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Remarks 

Reconsideration of this Application is respectfully requested. 

Claims 38-42, 45-49 and 51-57 are pending in the application, with claims 38 and 
51 being the independent claims. Claims 1-37, 43, 44 and 50 are cancelled. Claims 38, 
47, 48, 49 and 51 have been amended. Support for the amendments to claims 38 and 51 
claims may be found throughout the specification. Support for the amendment to claims 
47-49 may be found throughout the specification, e.g., in paragraph [0072] on page 23. 

Based on the following remarks, Applicants respectfully request that the 
Examiner reconsider all outstanding rejections and that they be withdrawn. 

Examiner Interview 

On June 6, 2007, the Examiner initiated a telephonic interview with Applicants 
representatives. The Examiner suggested additional amendments to the claims to place 
the claims in condition for allowance. Applicants thank the Examiner for suggesting the 
claim amendments and have amended the claims in accordance with the Examiner's 
suggestions. 

The Examiner further requested that Applicants submit copies of Declarations 
Under 37 C.F.R. § 1.132 by En Li, Ph.D., an inventor, and Kenneth D. Bloch, M.D. that 
were filed in parent Appl. No. 09/720,086 in support of Applicants' priority claim. 
Copies of these documents are submitted herewith. 

Accordingly, Applicants respectfully request that the Examiner reconsider and 
withdraw all outstanding objections and rejections. 
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Conclusion 

All of the stated grounds of objection and rejection have been properly traversed, 
accommodated, or rendered moot. Applicants therefore respectfully request that the 
Examiner reconsider all presently outstanding objections and rejections and that they be 
withdrawn. Applicants believe that a full and complete reply has been made to the 
outstanding Office Action and, as such, the present application is in condition for 
allowance. If the Examiner believes, for any reason, that personal communication will 
expedite prosecution of this application, the Examiner is invited to telephone the 
undersigned directly at (202) 772-8658. 

Prompt and favorable consideration of this Reply is respectfully requested. 

Respectfully submitted, 
Sterne, Kessler, Goldstein & Fox p.l.l.c. 



Daniel J.>Nevrivy / J 
Agent for Applicants 
Registration No. 59,118 
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THE UNITED STATES PATENT AND TRADEMARK OFFICE 



In re application of: 
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Art Unit: 



Confirmation No.: 6968 



1642 



Appl.No. 09/720,086 
102(e); July 23, 2001 



Examiner: Harris, A. M. 

Atty. Docket: 0609.4560002/KRM/DJN 



For: De Novo DNA Cytosine 
Methyltransferase Genes, 
Polypeptides and Uses Thereof | 

Declaration Under 37 CF.R § 1 -132 of En Li 5 PhD. 

Commissioner for Patents 
P.O. Box 1450 
Alexandria, VA 22313-1450 



I, the undersigned, En Li, PhD., residing at 45 Hinckley Road, 
Newton, Massachusetts, 02168, declare and state as follows: 

1. I am a co-inventor of the above-capti oned patent application. 

2. I am currently employed by Novartis Institutes for Biomedical Research as 
Vice President & Global Head, Animal Models of Disease and Epigenetics Program. Prior 
to my current employment, I was an Associate Professor of Medicine at Harvard Medical 
School and directed a laboratory in the Cardiovascular Research Center at the Massachusetts 
General Hospital from January 1993 to April 2003, where I conducted and supervised 
research in the field of mouse genetics and developmental biology, 

3 . A current curriculum vitae is appended hereto as EXHIBIT A. 

4. I have reviewed the abo ve-captioned patent application and the Office Action 
dated June 6, 2005. I have also reviewed the sequence listing as filed and the sequence 
■ listing as amended on July 23, 2001. I have also reviewed (he claims of the captioned patent 
application. 

5. I have been informed that the Examiner has not granted priority to the earlier 
filed* patent applications because there is insufficient proof that the coding regions of 
currently amended SEQ ID NOS:l and 2 are the same as those listed in the priority 
documents, viz., the mouse Dnmt3aaodDnmt3b cDNA clones encoding the coding regions 
of SEQ ID NOS;l and 2, respectively, that were deposited with the American Type Culture 
Collection (ATCC), 10801 University Boulevard, Manassas, Virginia 201 10-2209, USA- 
Sequences harboring the coding regions of SEQ ID NOS: 1 and 2, respectively, were 
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deposited with the ATCC on June 16, 1998, and assigned ATCC Deposit Nos. 209933 and 
209934, respectively. The deposit date of June 16, 1998 was prior to the filing date of the 
first provisional application, App. No. 60/090,906, filedJune25, 1998, the benefit of which 
is claimed. The '906 application includes the sequence information and references the 
deposits of the sequenced material on page 15, lines 26, thixmgh page 16, line 2, of the 
specification, 

6. In November 2004, Applicants had samples withdrawn of the mouse Dnmt3a 
and Dnmt3b cDNA clones contained within ATCC Deposit Nos. 209933 and 209934> 
respectively. At the Applicants request, Kenneth D. Bloch, M.D., a faculty member in ifae 
Cardiovascular Research Center at the Massachusetts General Hospital and an experienced 
DNA sequencer, sequenced nucleotides that spanned the coding regions of mouse Dnmt3a 
and Dnmt3b in the deposited cDNA. A nucleotide alignment that spans the coding regions 
of sequenced mouse Dnmt3a cDNA clone contained in ATCC Deposit No. 209933 and 
currently amended SEQ ID NO;l is shown in Exhibit B. A nucleotide alignment that spans 
the coding regions of sequenced mouse Dnmt3b cDNA clone contained in ATCC Deposit 
No. 209934 and currently amended SEQ ID NO: 2 is shown in Exhibit C. 

7. The amendment to the sequence listing, which was filed on July 23, 2001, 
corrected six nucleotides in the coding sequence of SEQ ED NO:l (see the bolded 
nucleotides at positions 516, 843, 1036, 1110, 1116 and 1726 in EXHIBIT B) and two 
nucleotides in the coding sequence of SEQ ID NO j2 (see the bolded nucleotides at positions 
918 and 920 in EXHIBIT C), 

8. The deposited clones recited in flp and 6, above (i ATCC Deposit Nos. 
209933 and 209934) are the same as the deposited clones recited in the above-captioned 
application. The coding sequence of ATCC Deposit No. 209933 is currently believed to be 
the same as the coding sequence of currently amended SEQ ID NO: 1 . The coding sequence 
of ATCC Deposit No. 209934 is currently believed to be the same as the coding sequence of 
cuirently amended SEQ ID NO:2. 

9- It is well known that sequencing errors are a common problem in Molecular 
Biology. See, e.g., Peter Richterich, Estimation of Errors in 'Raw 1 DNA Sequences: A 
Validation Study, 8 Genome Research 251-59 (1 998)(EXHIBITD). I believe that one skilled 
in the art would have sequenced the deposited material and recognized the sequencing errors 
in the coding region. I believe that the correct mouse Dnmt3a and Dnmt3b coding 
sequences are inherent to the ATCC deposited clones, ATCC Deposit Nos. 209933 and 
209934, respectively, which were deposited prior to the filing of App. No. 60/090,906, filed 
June 25, 1998, the benefit of which is claimed, 

1 0. Accordingly, based on the above, 1 believe that Applicants are entitled to the 
June 25, 1 998 filing date for the coding sequences of mouse Dnmt3a and Dnmt3b contained 
within ATCC Deposit Nos. 209933 and 209934, respectively. 
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11. I hereby declare that all statements made herein of my own knowledge are 
tme and that all statements made on information and belief are believed to be true; and 
ftirther that these statements were made with the knowledge that willful false statements and 
the like 90 made are punishable by fine or imprisonment, or both, under Section 1001 of 
Title.l 8 of the United States Code and that such willflil ftlse statements may jeopardize the 
validity of the present patent application or any patent issued thereon. 



Respectfully submitted, 




En Li, PhD* 

Date: // J V* J £?5 ~ 
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CURRICULUM VITAE 



EnLi,J»h.D. 

Vice President & Global Head 
Animal Models of Disease 
Novartis Institute for Biomedical Research, Inc. 
250 Mass Ave. Cambridge, MA 02139 
Tel: 617-871-7072 
Fax. 617-871-7263 
Email, en.li@novartis.com 



Education: 

1984 B.Sc. in Biochemistry. Peking University, Beijing, China 

1992 Ph.D. Massachusetts Institute of Technology (Biology) 

(Advisor: Prof. Rudolf Jaenisch) 

Professional Experience: 

1993-1996 Principal Investigator, Cardiovascular Research Center, Massachusetts General 

Hospital 

Instructor, Department of Medicine, Harvard Medical School 
1996-2000 Assistant Professor, Department of Medicine, Harvard Medical School 

2000-2003 Associate Professor, Department of Medicine, Harvard Medical School 

200 1 -2003 Guest Professor, Beijing University, Health Science Center 

2003_ VP & Global Head, Models of Disease Center, Epigenetics Program 

Novartis Institute for Biomedical Research. 

Review and editorial board 

1 999- External Grant Reviewer, Human Frontier Science Program 

1999- External Grant Reviewer, NIH, NICHD 

1 999- Member of the Advisory Board. Journal of Biochemistry 

2000- External Grant Reviewer, NIH, NIA 
2000- External Grant Reviewer, NSF 
2000- Mail Reviewer, Wellcome Trust 

2003 - Member of the Advisory Board. China Science Reports 

2004 External Grant Reviewer, Chinese Natural Science Foundation. 

1993- Ad Hoc Reviewer for the following journals 

Nature, Science, Cell, Nat Genet, Genes Dev, Trends Genet, Development, Mol Cell 
Biol, PNAS, Human Mol. Genet., Dev. Biol., Mech. Dev., J. Cell Biol., Dev. Dyn., 
Gene, Genomics, Nucl Acid Res, Mammalian Genome, etc. 

Professional Societies 

1 990- American Association for the Advancement of Science, Member 



1994- 
1999- 



DNA Methylation Society, Member 
Ray Wu Society, Member 



Invited Presentations (Since 2003-) 

2003 Invited speaker, Gordon Research Conference - Cancer Genetics and Epigenetics 
2003 Session chair, . Keystone Symposium - Chromatin (Big Sky, Montana) 
2003 Invited speaker, Annual HUGO meeting at Cancun, Mexico 

2003 Invited speaker, Gordon Research Conference - Epigenetics 

2004 Invited speaker, the 2 nd Annual CDB symposium, Kobe, Japan 

2004 Invited speaker, 2 nd Weiseenburg Symposium on DNA methylation - an important genetics 

signals. Weissenburg, Germany 
2004 Session Chair, on genomic imprinting. 10 th SCBA International Symposiums, Beijing, China 

2004 Invited speaker, Genomic imprinting workshop in Montpellier, France 

2005 Vice Chair, Gordon Research Conference 'Cancer Genetics and Epigenetics' 
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Exhibit B 

Alignment spanning the coding region of mouse Dnmt3a sequence from ATCC 
Deposit No. 209933 (top) and currently amended SEQ ID NO:l (bottom) 1 

atgccctccagcggccccggggacaccagcagctcctctctggagcgggaggatgatcga 

II 1 1 IN I Mill II I MINI II I III Ml I MINI IIIIIIM I lllllll I Mill 

atgccctccagcggccccggggacaccagcagctcctctctggagcgggaggatgatcga 276 
aaggaaggagaggaacaggaggagaaccgtggcaaggaagagcgccaggagcccagcgcc 

1 1 1 1 1 1 1 1 1 1 1 1 1 M 1 1 1 1 1 1 1 1 1 1 1 1 Ml 1 1 1 II 1 1 1 1 II 1 1 ! II M 1 1 1 1 1 II 1 1 1 1 1 

aaggaaggagaggaacaggaggagaaccgtggcaaggaagagcgccaggagcccagcgcc 3 3 6 
acggcccggaaggtggggaggcctggccggaagcgcaagcacccaccggtggaaagcagt 

1 1 1 1 1 1 1 ! 1 1 1 1 1 1 1 1 i 1 1 1 1 M I i 1 1 1 f 1 1 1 1 1 1 1 1 1 M 1 f 1 1 1 1 M 1 1 1 1 1 1 1 1 1 1 1 1 

acggcccggaaggtggggaggcctggccggaagcgcaagcacccaccggtggaaagcagt 3 96 
gacacccccaaggacccagcagtgaccaccaagtctcagcccatggcccaggactctggc 

Mill 1 1 1 1 ! 1 1 1 II I lllllll I lllllll I MINI I lllllll lllllll MM Ml 

gacacccccaaggacccagcagtgaccaccaagtctcagcccatggcccaggactctggc 4 56 
ccctcagatctgctacccaatggagacttggagaagcggagtgaaccccaacctgaggag 

1 1 1 II 1 1 i 1 1 1 1 1 E 1 1 1 1 1 1 1 1 1 1 1 4 1 r 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ! 1 1 

ccctcagatctgctacccaatggagacttggagaagcggagtgaaccccaacctgaggag 516 
gggagcccagctgcagggcagaagggtggggccccagctgaaggagagggaactgagacc 

1 1 II 1 1 1 1 1 1 1 1 II 1 1 1 1 MM 1 1 1 1 II II 1 1 III I II I Ml II II 1 1 1 1 INI 1 1 II II 

gggagcccagctgcagggcagaagggtggggccccagctgaaggagagggaactgagacc 576 
ccaccagaagcctccagagctgtggagaatggctgctgtgtgaccaaggaaggccgtgga 

1 1 1 1 e 1 1 1 i 1 1 1 J 1 1 1 1 1 1 ) r 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

ccaccagaagcctccagagctgtggagaatggctgctgtgtgaccaaggaaggccgtgga 63 6 
gcctctgcaggagagggcaaagaacagaagcagaccaacatcgaatccatgaaaatggag 

1 1 1 1 1 1 1 M 1 1 1 1 1 1 1 1 1 1 1 1 1 1 M 1 1 1 i 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 M 1 1 

gcctctgcaggagagggcaaagaacagaagcagaccaacatcgaatccatgaaaatggag 6 96 
ggctcccggggccgactgcgaggtggcttgggctgggagtccagcctccgtcagcgaccc 

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 f 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

ggctcccggggccgactgcgaggtggcttgggctgggagtccagcctccgtcagcgaccc 756 
atgccaagactcaccttccaggcaggggacccctactacatcagcaaacggaaacgggat 

MIMIIIIIMMIIIIIIIIIIIIIIIMIIIIIIIIIIIIMIIIIIIIIIIMIII 

atgccaagactcaccttccaggcaggggacccctactacatcagcaaaeggaaacgggat 816 
gagtggctggcacgttggaaaagggaggctgagaagaaagccaaggtaattgcagtaatg 

I MM II MMMI IIIIIIM I MMIII IMIIIII llllll MM MM I lllllll 

gagtggctggcacgttggaaaagggaggctgagaagaaagccaaggtaattgcagtaatg 876 
aatgctgtggaagagaaccaggcctctggagagtctcagaaggtggaggaggccagccct 

lllllll MMMIMMMMMMMMMMMMMMIMMMIMMMIIII 

aatgctgtggaagagaaccaggcctctggagagtctcagaaggtggaggaggccagccct 93 6 
cctgctgtgcagcagcccacggaccctgcttctccgactgtggccaccacccctgagcca 

llllll I Mill II III MUM lllllll llllll IMIIIII IIIIIIM I llllll I 

cctgctgtgcagcagcccacggaccctgcttctccgactgtggccaccacccctgagcca 996 



1 Bolded nucleotides indicate nucleotides that were amended on July 23, 2001 . 
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gtaggaggggatgctggggacaagaatgctaccaaagcagccgacgatgagcctgagtat 

MM Mill MINN Ml Ml MINN Mlllll I MINN MINI 1 1 MINI M 

gtaggaggggatgctggggacaagaatgctaccaaagcagccgacgatgagcctgagtat 105 6 
gaggatggccggggctttggcattggagagctggtgtgggggaaacttcggggcttctcc 

1 1 1 1 1 1 i 1 1 1 1 1 1 M 1 1 1 1 1 1 1 1 1 1 1 1 II 1 1 1 1 II 1 1 1 1 1 1 1 1 1 1 1 1 1 M 1 1 II 1 1 1 1 1 1 

gaggatggccggggctttggcattggagagctggtgtgggggaaacttcggggcttctcc 1116 
tggtggccaggccgaattgtgtcttggtggatgacaggccggagccgagcagctgaaggc 
tggtggccaggccgaattgtgtcttggtggatgacaggccggagccgagcagctgaaggc 117 6 
actcgctgggtcatgtggttcggagatggcaagttctcagtggtgtgtgtggagaagctc 

MIMI II IMIM Mlllll I II IMMMIIMI I MM I II Ml MM MINI I II 

actcgctgggtcatgtggttcggagatggcaagttctcagtggtgtgtgtggagaagctc 12 3 6 
atgccgctgagctccttctgcagtgcattccaccaggccacctacaacaagcagcccatg 

1 1 1 M 1 1 1 1 1 ! 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 j 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

atgccgctgagctccttctgcagtgcattccaccaggccacctacaacaagcagcccatg 12 96 
taccgcaaagccatctacgaagtcctccaggtggccagcagccgtgccqqqaaqctqttt 

M II II 1 1 1 1 II 1 1 1 MM II MINI II MIMI I IMIM Mill II I IMIM I Ml 

taccgcaaagccatctacgaagtcctccaggtggccagcagccgtgccgggaagctgttt 13 56 
ccagcttgccatgacagtgatgaaagtgacagtggcaaggctgtggaagtgcagaacaaq 

1 1 1 1 M 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 f 1 1 1 1 1 1 1 j 1 1 ] 1 1 1 1 1 1 1 1 1 1 ! 1 1 1 M 1 1 1 1 1 1 1 j 1 1 1 

ccagcttgccatgacagtgatgaaagtgacagtggcaaggctgtggaagtgcagaacaag 1416 
caga.tgattgaatgggccctcggtggcttccagccctcgggtcctaagggcctggagcca 

1 1 1 1 1 M 1 1 1 1 1 1 1 1 1 1 1 1 II 1 1 1 1 1 1 1 1 1 1 1 1 M 1 1 1 1 1 1 M 1 1 1 1 1 M 1 1 1 1 M 1 1 1 1 

cagatgattgaatgggccctcggtggcttccagccctcgggtcctaagggcctggagcca 147 6 
ccagaagaagagaagaatccttacaaggaagtttacaccgacatgtgggtggagcctqaa 

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 E 1 1 1 1 1 1 1 1 1 ! 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

ccagaagaagagaagaatccttacaaggaagtttacaccgacatgtgggtggagcctgaa 153 6 
gcagctgcttacgccccacccccaccagccaagaaacccagaaagagcacaacagagaaa 

1 1 1 r J 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 f i 1 1 1 1 1 1 r i f r 1 1 1 1 1 1 1 1 r 1 1 1 1 1 1 1 j r 1 1 1 

gcagctgcttacgccccacccccaccagccaagaaacccagaaagagcacaacagagaaa 15 96 
cctaaggtcaaggagatcattgatgagcgcacaagggagcggctggtgtatgaggtgcgc 

MM II II I III Mlllll I IMIM llllllll I Mill II Mlllll MIMI I MM 

cctaaggtcaaggagatcattgatgagcgcacaagggagcggctggtgtatgaggtgcgc 1656 
cagaagtgcagaaacatcgaggacatttgtatctcatgtgggagcctcaatgtcaccctq 

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 II II I II 1 1 MM I II I II 1 1 II I II I Ml 1 1 1 1 1 1 1 II 1 1 1 III 

cagaagtgcagaaacatcgaggacatttgtatctcatgtgggagcctcaatgtcaccctg 1716 
gagcacccactcttcattggaggcatgtgccagaactgtaagaactgcttcttggagtgt 

1 1 1 in 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

gagcacccactcttcattggaggcatgtgccagaactgtaagaactgcttcttggagtgt 177 6 
gcttaccagtatgacgacgatgggtaccagtcctattgcaccatctgctgtggggggcgt 

i iiiiiii i iiiiiiii ii nil 

gcttaccagtatgacgacgatgggtaccagtcctattgcaccatctgctgtggggggcgt 18 36 
gaagtgctcatgtgtgggaacaacaactgctgcaggtgcttttgtgtcgagtgtgtggat 

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

gaagtgctcatgtgtgggaacaacaactgctgcaggtgcttttgtgtcgagtgtgtggat 1896 



O O 
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ctcttggtggggccaggagctgctcaggcagccattaaggaagacccctggaactgctac 

IIIIMIIIIIIIIIIIIII IIIIIIIIIIIIIIIIIIIIIIIIIMIIIII IIMMII 

ctcttggtggggccaggagctgctcaggcagccattaaggaagacccctggaactgctac 1956 
atgtgcgggcataagggcacctatgggctgctgcgaagacgggaagactggccttctcga 

INN I Mill I ! I lllllll II 1 1 IIIMIIMI II IIMMII 1 1 1 MM Mill 1 1 1 

atgtgcgggcataagggcacctatgggctgctgcgaagacgggaagactggccttctcga 2 016 
ctccagatgttctttgccaataaccatgaccaggaatttgaccccccaaaggtttaccca 

1 1 1 M 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 E 1 1 i 1 1 1 1 1 1 1 1 II 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 M I 

ctccagatgttctttgccaataaccatgaccaggaatttgaccccccaaaggtttaccca 2 076 
cctgtgccagctgagaagaggaagcccatccgcgtgctgtctctctttgatgggattgct 

MM I MIIM II MINIMI 1 1 Mill Mill MUM Mill II Mill Mill II I 

cctgtgccagctgagaagaggaagcccatccgcgtgctgtctctctttgatgggattgct 213 6 
acagggctcctggtgctgaaggacctgggcatccaagtggaccgctacattgcctccgag 

1 1 1 1 1 1 1 1 1 II 1 1 1 1 1 1 1 1 II 1 1 1 1 1 1 1 1 1 1 i 1 1 1 M 1 1 1 f 1 1 1 1 1 1 m 1 1 1 M M 1 1 1 1 

acagggctcctggtgctgaaggacctgggcatccaagtggaccgctacattgcctccgag 2196 
gtgtgtgaggactccatcacggtgggcatggtgcggcaccagggaaagatcatgtacgtc 

1 1 1 1 1 1 1 1 1 1 1 1 f 1 1 1 1 1 1 r 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 i 1 1 1 1 1 1 1 e i r 

9tgtgtgaggactccatcacggtgggcatggtgcggcaccagggaaagatcatgtacgtc 2256 
ggggacgtccgcagcgtcacacagaagcatatccaggagtggggcccattcgacctggtg 

Ml M Ml IMIIIIII IMMIIMM IIIMIIMI IIMMIIIMII I IIMMII 

ggggacgtccgcagcgtcacacagaagcatatccaggagtggggcccattcgacctggtg 2316 
attggaggcagtccctgcaatgacctctccattgtcaaccctgcccgcaagggactttat 

i ] 1 1 1 1 1 1 r i e i f f 1 1 1 1 1 1 1 1 1 1 1 1 i i i 1 1 1 1 e 1 1 1 1 1 1 1 1 1 1 1 1 1 ; i f 1 1 1 1 1 1 1 1 1 j 

attggaggcagtccctgcaatgacctctccattgtcaaccctgcccgcaagggactttat 23 7 6 
9 a 999tactggccgcctcttctttgagttctaccgcctcctgcatgatgcgcggcccaag 

i 1 1 1 1 1 1 i 1 1 1 1 [ 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 f 1 1 j 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

9 a 999tactggccgcctcttctttgagttctaccgcctcctgcatgatgcgcggcccaag 24 3 6 
gagggagatgatcgccccttcttctggctctttgagaatgtggtggccatgggcgttagt 

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

gagggagatgatcgccccttcttctggctctttgagaatgtggtggccatgggcgttagt 24 96 
gacaagagggacatctcgcgatttcttgagtctaaccccgtgatgattgacgccaaagaa 

llllllll 1 1 Ml lllllll II IIMMII II IIIIIMM II 1 1 MIMIIM MIIM 

gacaagagggacatctcgcgatttcttgagtctaaccccgtgatgattgacgccaaagaa 2 556 
gtgtctgctgcacacagggcccgttacttctggggtaaccttcctggcatgaacaggcct 

llllllll 1 1 IMIIIIII I IIIIMIIIIIIIIIIIIII II II IIIIIMM I II MM 

gtgtctgctgcacacagggcccgttacttctggggtaaccttcctggcatgaacaggcct 2 616 
ttggcatccactgtgaatgataagctggagctgcaagagtgtctggagcacggcagaata 

M it 1 1 1 1 1 ii i ii it i M i M M ii ii M M 1 1 1 ii i ii ii M ii i ii i ii ii 1. 1 1 ii i 

ttggcatccactgtgaatgataagctggagctgcaagagtgtctggagcacggcagaata 2 676 
gccaagttcagcaaagtgaggaccattaccaccaggtcaaactctataaagcagggcaaa 

lllllll IIIIIIIIIIMMIIIIIIIIMMIIIIIIMMMIMIIIIIMIIIII 

gccaagttcagcaaagtgaggaccattaccaccaggtcaaactctataaagcagggcaaa 2 73 6 
gacc'agcatttccccgtcttcatgaacgagaaggaggacatcctgtggtgcactgaaatg 

1 1 ) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 i 1 1 1 1 1 1 1 1 1 1 1 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 M 1 1 1 1 1 1 1 1 

gaccagcatttccccgtcttcatgaacgagaaggaggacatcctgtggtgcactgaaatg 2 7 96 
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gaaagggtgtttggcttccccgtccactacacagacgtctccaacatgagccgcttggcg 

II I MINIM 1 1 lllllllllllllll II lllllll I III III 1 1 1 IIIMM I Mill 

gaaagggtgtttggcttccccgtccactacacagacgtctccaacatgagccgcttggcg 2 856 

aggcagagactgctgggccgatcgtggagcgtgccggtcatccgccacctcttcgctccg 

I I I I I I I I I I I I I I I I I I I III I I I III I I I I II I I I I II I I II I I I I I I I I I M I I I I I 
aggcagagactgctgggccgatcgtggagcgtgccggtcatccgccacctcttcgctccg 2 916 

ctgaaggaatattttgcttgtgtgtaagggacatgggggcaaactgaagtagtgatgata 

I I I I I I I I I I I I I I I II I I I I I I I I I I I I I I I I I I II I I I I I I I I I I I M I II II | | | | | 
ctgaaggaatattttgcttgtgtgtaagggacatgggggcaaactgaagtagtgatgata 2 97 6 

aaaaagttaaacaaacaaacaaacaccaagaacgagaggacggagaaaagttcagcaccc 

II MMMII 1 1 lllllll MM Mill lllllll I MM Mill lllllllllllllll 

aaaaagttaaacaaacaaacaaacaccaagaacgagaggacggagaaaagttcagcaccc 3 03 7 
agaagagaaaaaggaatttaaagcaaaccacagaggaggaaaacgccggagggcttggcc 

< 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 E 1 1 E 1 1 1 M I f 1 1 1 1 1 1 r ! 1 1 f 1 1 1 1 1 1 1 1 1 1 

agaagagaaaaaggaatttaaagcaaaccacagaggaggaaaacgccggagggcttggcc 3 0 98 
ttgcaaaagggttggacatcatctcctgagttttcaatgttaaccttcagtcctatctaa 

1 1 1 1 1 1 1 f E 1 1 1 i I e 1 1 1 1 i 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 E 1 1 1 1 1 1 1 1 1 E 1 1 1 1 1 1 1 1 1 1 1 

ttgcaaaagggttggacatcatctcctgagttttcaatgttaaccttcagtcctatctaa 315 8 
aaagcaaaataggcccctccccttcttcccctccggtcctaggaggcgaactttttgttt 

i M I I I I I I I M I I I I I I I I E I I I 1 1 I I I 1 1 1 I 1 1 M M I I I 1 1 M I I 1 1 I 1 1 I I I I I I I 

aaagcaaaataggcccctccccttcttcccctccggtcctaggaggcgaactttttgttt 3 218 
tctactctttttcagaggggttttctgtttgtttgggtttttgtttcttgctgtgactga 

I I lllllll 1 1 lllllllllllllll II 1 1 II 1 1 1 MMMII I II Mill 1 1 lllllll 

tctactctttttcagaggggttttctgtttgtttgggtttttgtttcttgctgtgactga 327 8 
aacaagagagttattgcagcaaaatcagtaacaacaaaaagtagaaatgccttggagagg 

1 1 1 1 1 1 1 1 1! 1 1 1 1 1 1 1 ! 1 1 j 1 1 1 1 1 1 M 1 1 1 1 1 1 M 1 1 1 1 1 1 1 1 1 1 1 1 1 f I M 1 1 1 1 1 1 

aacaagagagttattgcagcaaaatcagtaacaacaaaaagtagaaatgccttggagagg 3 33 8 
aaagggagagagggaaaattctataaaaacttaaaatattggttttttttttttttcctt 

1 1 1 1 1 1 r 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 E I E I E 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 E 1 1 1 1 1 E 1 1 1 E I 

aaagggagagagggaaaattctataaaaacttaaaatattggttttttttttttttcctt 3 3 98 
ttctatatatctctttggttgtctctagcctgatcagataggagcacaaacaggaagaga 

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Ii M 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ii 1 1 1 1 1 1 it 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

ttctatatatctctttggttgtctctagcctgatcagataggagcacaaacaggaagaga 3 458 
atagagaccctcggaggcagagtctcctctcccaccccccgagcagtctcaacagcacca 

IIIIIIIIIIIIIIIIIIIIIIIMIIIIIIIIIIIIIIMIIIIMIIIIIIIIIIIII 

atagagaccctcggaggcagagtctcctctcccaccccccgagcagtctcaacagcacca 3 518 
ttcctggtcatgcaaaacagaacccaactagcagcagggcgctgagagaacaccacacca 

1 1 1 1 1 1 1 1 1 1 f i 1 1 1 1 1 1 1 1 1 1 1 1 1 1 j 1 1 E 1 1 1 j 1 1 1 E i f e 1 1 1 1 1 1 1 1 E 1 1 1 1 1 1 1 1 1 1 

ttcctggtcatgcaaaacagaacccaactagcagcagggcgctgagagaacaccacacca 3 578 
gacactttctacagtatttcaggtgcctaccacacaggaaaccttgaagaaaaccagttt 

I II I'M 1 1 1 1 1 1 II I II 1 1 1 II 1 1 1 1 1 1 i 1 1 II I II M 1 1 1 1 1 II I II 1 1 1 1 II 1 1 II 1 1 

gacactttctacagtatttcaggtgcctaccacacaggaaaccttgaagaaaaccagttt 3 63 8 
ctagaagccgctgttacctcttgtttacagtt 

II 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 M 1 1 1 1 1 1 1 

ctagaagccgctgttacctcttgtttacagtt 367 0 
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Exhibit C 

Alignment spanning the coding region of mouse Dnmt3b from ATCC Deposit No. 
209934 (top) and currently amended SEQ ID NO:2 (bottom) 2 

caggaaacaatgaagggagacagcagacatctgaatgaagaagagggtqccagcqqqtat 

II MIMIIIIIIIIIIIIIIIIIIIIIIIIIIIIllllllliiiiii 

caggaaacaatgaagggagacagcagacatctgaatgaagaagagggtgccagcgggtat 319 
gaggagtgcattatcgttaatgggaacttcagtgaccagtcctcagacacgaaqqatact 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

gaggagtgcattatcgttaatgggaacttcagtgaccagtcctcagacacgaaggatgct 379 
ccctcacccccagtcttggaggcaatctgcacagagccagtctgcacaccagagaccaga 

MINI MIIIIIIIIIIIIIIIIIIIIIMIIMIIIMIMIIIIIIIIIIII 

ccctcacccccagtcttggaggcaatctgcacagagccagtctgcacaccagagaccaga 439 
ggccgcaggtcaagctcccggctgtctaagagggaggtctccagccttctgaattacacq 

MMIIIIIIIIIIIII Mllllllllll IMIIIMIIIIIII I MINIMI I Mill 

ggccgcaggtcaagctcccggctgtctaagagggaggtctccagccttctgaattacacg 499 
caggacatgacaggagatggagacagagatgatgaagtagatgatgggaatggctctgat 

Ml III 1 1 Mill III I Mill 1 1 Mil || | mill Mill j | MINI 1 1| I nil i 

caggacatgacaggagatggagacagagatgatgaagtagatgatgggaatggctctgat 55 9 
attctaatgccaaagctcacccgtgagaccaaggacaccaggacgcgctctqaaaqcccq 

in iiiiiiiiii ii iiiiiiiiiiiiiiiiiiiiiiiiiiiii 

attctaatgccaaagctcacccgtgagaccaaggacaccaggacgcgctctgaaagcccg 619 
gctgtccgaacccgacatagcaatgggacctccagcttggagaggcaaagagcctccccc 

Mini iimmmim iiiiiiiiiiiiiiiiiiiiiii 

gctgtccgaacccgacatagcaatgggacctccagcttggagaggcaaagagcctccccc 67 9 
agaatcacccgaggtcggcagggccgccaccatgtgcaggagtaccctgtggagtttccg 

I IMMIIIIIIIIIIIIIIimilllllllllllllllllllllllllllMI 

agaatcacccgaggtcggcagggccgccaccatgtgcaggagtaccctgtggagtttccg 73 9 
gctaccaggtctcggagacgtcgagcatcgtcttcagcaagcacgccatggtcatcccct 

1 1 1 1 . J 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 j 1 1 1 1 j j j 1 1 1 1 1 f 1 1 1 1 1 1 1 1 f 1 1 f 1 1 1 ii 1 1 1 1 1 1 j 1 1 

gctaccaggtctcggagacgtcgagcatcgtcttcagcaagcacgccatggtcatcccct 799 
gccagcgtcgacttcatggaagaagtgacacctaagagcgtcagtaccccatcagttgac 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

gccagcgtcgacttcatggaagaagtgacacctaagagcgtcagtaccccatcagttgac 859 
ttgagccaggatggagatcaggagggtatggataccacacaggtggatgcagagaqcaqa 

iiiiiiiiimmiiimm m i m i iiiiiiiiiiii 

ttgagccaggatggagatcaggagggtatggataccacacaggtggatgcagagagcaga 919 
gatggagacagcacagagtatcaggatgataaagagtttggaataggtgacctcqtqtqq 

III Ml MM I MM MM MM Ill II II III M 1 1 1 1 1 1 1 1 1 1 

gatggagacagcacagagtatcaggatgataaagagtttggaataggtgacctcgtgtgg 97 9 



ggaaagatcaagggcttctcctggtggcctgccatggtggtgtcctggaaagccacctcc 

MIMIIIIIIIIIII MIMIIMIIMIIIIMIIIIIIIIII 

ggaaagatcaagggcttctcctggtggcctgccatggtggtgtcctggaaagccacctcc 1039 



2 Bolded nucleotides indicate nucleotides that were amended on July 23, 2001. 
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^?????^ ggccatgccc ^aatgcgctgggtacagtggtttggtgatggcaagttttct 

M 1 1 1 1 1 MINIUM IIIMIIIIIIIIIIIIIIIIMIIII 

aagcgacaggccatgcccggaatgcgctgggtacagtggtttggtgatggcaagttttct 1099 
gagatctctgctgacaaactggtggctctggggctgttcagccagcactttaatctgqct 

IMIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIMIIIIIlll Hiiiii 

gagatctctgctgacaaactggtggctctggggctgttcagccagcactttaatctggct 1159 

TTTTiTTrTMTTTTTTTTTTiTiTTTTTTTTTTTTiTTTTTTTTTTTTTTTTTTTTTTT 

accttcaataagctggtttcttataggaaggccatgtaccacactctggagaaagccagg 1219 

TTTTTT?TTTTTTTTT?TTTT???T?TTT???T??TTT?T?TTT?TTTTT??TTTT?TT? 

gttcgagctggcaagaccttctccagcagtcctggagagtcactggaggaccagctgaag 1279 
cccatgctggagtgggcccacggtggcttcaagcctactgggatcgagggcctcaaaccc 

MMMIIIIMIIIIMIIIIIMIIIIIIIIIIMIIIIIIIIIIIIIIIII 

cccatgc tggagtgggcccacggtggcttcaagcctactgggatcgagggcctcaaaccc 13 3 9 
aacaagaagcaaccagtggttaataagtcgaaggtgcgtcgttcagacagtaggaactta 

MMIMIIIIIIMIillMMIIIMIIIIIIIIII MINI MINI llllllllll 

aacaagaagcaaccagtggttaataagtcgaaggtgcgtcgttcagacagtaggaactta 13 99 
gaacccaggagacgcgagaacaaaagtcgaagacgcacaaccaatgactctgctgcttct 

MMMMIIMIIIIMIIIMMIMI IMIIIMIIIIMI IIIIIMIIIIIIIII 

gaacccaggagacgcgagaacaaaagtcgaagacgcacaaccaatgactctgctgcttct 1459 
gagtcccccccacccaagcgcctcaagacaaatagctatggcgggaaggaccqaqqqqaQ 

Mill I I II II I III III III II I llllllllll 

gagtcccccccacccaagcgcctcaagacaaatagctatggcgggaaggaccgaggggag 1519 
gatgaggagagccgagaacggatggcttctgaagtcaccaacaacaagggcaatctggaa 

iiii.iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

gatgaggagagccgagaacggatggcttctgaagtcaccaacaacaagggcaatctggaa 1579 
gaccgctgtttgtcctgtggaaagaagaaccctgtgtccttccaccccctctttgagggt 

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIMIIIIIII 1 1 Mil ill 

gaccgctgtttgtcctgtggaaagaagaaccctgtgtccttccaccccctctttgagggt 163 9 
gggctctgtcagagttgccgggatcgcttcctagagctcttctacatgtatgatgaggac 

IIHIIIIIII Ml llllllllll I llllllllll INI II MINI I II Mill 

gggctctgtcagagttgccgggatcgcttcctagagctcttctacatgtatgatgaggac 1699 
ggctatcagtcctactgcaccgtgtgctgtgagggccgtgaactgctgctgtgcagtaac 

MM Mill I Ml MM Mill Mill! MINI I III MMIMM Mill Mill 

ggctatcagtcctactgcaccgtgtgctgtgagggccgtgaactgctgctgtgcagtaac 175 9 
acaagctgctgcagatgcttctgtgtggagtgtctggaggtgctggtgggcgcaggcaca 
acaagctgctgcagatgcttctgtgtggagtgtctggaggtgctggtgggcgcaggcaca 1819 
gctgaggatgccaagctgcaggaaccctggagctgctatatgtgcctccctcagcgctgc 

IIIIIIIIIIMIIIIIIIIIMIMIIIMIIIMIIIIIMIIIIIIIIIIIIIIIII 

gc tgaggatgccaagc tgcaggaaccctggagctgc tat atgtgcct ccc tcagcgctgc 18 7 9 
catggggtcctccgacgcaggaaagattggaacatgcgcctgcaagacttcttcactact 

NMIMMIMMIIIMMIIMMIIIMIIMIMIIMMIIMIIIIIIMIII 

catggggtcctccgacgcaggaaagattggaacatgcgcctgcaagacttcttcactact 193 9 
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gatcctgacctggaagaatttgagccacccaagttgtacccagcaattcctgcagccaaa 

imnmimmmmmiimmmmmmiimmmmimmmiiimiiimi 

gatcctgacctggaagaatttgagccacccaagttgtacccagcaattcctgcagccaaa 1999 
aggaggcccattagagtcctgtctctgtttgatggaattgcaacggggtacttggtgctc 

iiiiiiiiiiiiiiiriiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiniiiiiii 

aggaggcccattagagtcctgtctctgtttgatggaattgcaacggggtacttggtgctc 2 05 9 
aaggagttgggtattaaagtggaaaagtacattgcctccgaagtctgtgcagaqtccatc 

1 1 II III III Illllllll llllll MIMI IIIMIMM MM IIMIIIIIIIMI 

aaggagttgggtattaaagtggaaaagtacat-tgcctccgaagtctgtgcagagtccatc 2119 
gctgtgggaactgttaagcatgaaggccagatcaaatatgtcaatgacgtccggaaaatc 

1 1 1 i 1 1 M 1 1 1 1 1 ! I ! i M 1 1 1 i 1 1 1 i 1 1 1 M ! I ! M 1 1 1 1 1 1 1 1 1 1 1 1 1 1 M I 

gctgtgggaactgttaagcatgaaggccagatcaaatatgtcaatgacgtccggaaaatc 2179 
accaagaaaaatattgaagagtggggcccgttcgacttggtgattggtggaagcccatqc 

M MIMI M MIMIMIIII Mill I IIIIIIIIIMI IIIIMIIMIIIIIIIIII 

accaagaaaaatattgaagagtggggcccgttcgacttggtgattggtggaagcccatgc 223 9 
aatgatctctctaacgtcaatcctgcccgcaaaggtttatatgagggcacaggaaqqctc 

Ml I MIMI IIIIIIIIMIIIII MMMIIIIIIMI IMIIMMIMIIIIIII I 

aatgatctctctaacgtcaatcctgcccgcaaaggtttatatgagggcacaggaaggctc 2299 
ttcttcgagttttaccacttgctgaattatacccgccccaaggagggcgacaaccqtcca 

IIIIIIIIIIMIIIIIII MIMI lllllllllllllllllllllllllllllllllll 

ttcttcgagttttaccacttgctgaattatacccgccccaaggagggcgacaaccgtcca 2359 
ttcttctggatgttcgagaatgttgtggccatgaaagtgaatgacaagaaagacatctca 

MIIIIIIIIIIIIIIIIMIIIIIIIIMIIIIIIIIII MIIIIMMIIIIIMM 

ttcttctggatgttcgagaatgttgtggccatgaaagtgaatgacaagaaagacatctca 2419 
agattcctggcatgtaacccagtgatgatcgatgccatcaaggtgtctgctgctcacagg 

IMIIII MIIIMMIIIIIMIIMMIMMMIMMMIIIIIIIIIIIIIIMI 

agattcctggcatgtaacccagtgatgatcgatgccatcaaggtgtctgctgctcacagg 2479 
gcccggtacttctggggtaacctacccggaatgaacaggcccgtgatggcttcaaagaat 

M M 1 1 ! 1 1 If 1 1 1 E I j 1 1 1 1 1 1 M f 1 1 1 f I f 1 1 1 1 II I i M 1 1 i 1 1 1 1 1 M 1 1 1 

gcccggtacttctggggtaacctacccggaatgaacaggcccgtgatggcttcaaagaat 253 9 
gataagctcgagctgcaggactgcctggagttcagtaggacagcaaagttaaagaaagtg 

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

gataagctcgagctgcaggactgcctggagttcagtaggacagcaaagttaaagaaagtg 2599 
cagacaataaccaccaagtcgaactccatcagacagggcaaaaaccagcttttccctgta 

MIIMMII MIMI III IMIIMII IMIIIIIMIIMIIIIIIIIIIIIIIIIII 

cagacaataaccaccaagtcgaactccatcagacagggcaaaaaccagcttttccctgta 2659 
gtcatgaatggcaaggacgacgttttgtggtgcactgagctcgaaaggatcttcggcttc 

Mill 1 1 MM 1 1 Mill Mill II I III III MM llllll II MM llllll 

gtcatgaatggcaaggacgacgttttgtggtgcactgagctcgaaaggatcttcggcttc 2719 
cctgctcactacacggacgtgtccaacatgggccgcggcgcccgtcagaagctgctgggc 

llirillMIIMIIIIIMIIIIIIIIIIMIIIIIIIIIIIIIIIIIIIIIIIIIIII 

cctgctcactacacggacgtgtccaacatgggccgcggcgcccgtcagaagctgctgggc 2779 
aggtcctggagtgtaccggtcatcagacacctgtttgcccccttgaaggactactttgcc 

IMIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIMIIIIIIIII 

aggtcctggagtgtaccggtcatcagacacctgtttgcccccttgaaggactactttgcc 2839 
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tgtgaatagttcta^ 

tgtgaatagttctacccaggactggggagctctcggtcagagccagtgcccagagtc 2896 
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Estimation 
Sequences: A Validation Study 



As BISliA seciuencNgii performed more and more in a mass^roductioiWiike manner, efficient quality control 
measures b^oriie increasingly i ni porta nt For process control, but so also dbfes th e ability fo com pare different 
methods and profec^, ©ne of the fiindamentai quality measures In sequencing projects is the posltloiHpeciflc 
error ^I^i&^ilETy at all tase^ ijt each individual sequence. Accurate prediction of base-specific error rates froni 
"raw" sequence data would allow Immediate quality control as well as benchmarking different methods and 
projects while avoiding the Inefficiencies and time delays associated with resequenclng and assessments after 
*7iH}stifng: r ' a sequence, "The program PHRED provides base-specifit quality scores that are Idgarythuiically 
related to error probabilities^ This study assessed the accuracy of PHRE&'s error-rate pfedicitlon by analyzing 
sequencing proj^ctis from six different larg^-scale seqiiendiig;J4l^itoij^ 'Ml ; prdfecfes' used fourfold r 
fluorescent sequencing; but the sequencing methods used varied widely between the different projects. The 
results iiidkate that the error-rate predictions such as those given by PHRED can be highly accurate for a large 
variety of ;differen t sequtn cihg iiiethods as well as over a wide range of sequence quality. 



In DNA: sequencing, knowledge about the accuracy 
ot sequences can be very valuable. For example, dif- 
ferent large scale sequencing projects may produce 
sequences at similar rates and costs but with signifi- 
cantly different error rates in the final sequence. 
One major determinant in the final error rate Is the 
accuracy of the "raw" sequence. Knowledge about 
the frequency and location of errors in the raw se- 
quence data can help to direct ''polishing" efforts to 
the places where additional effort is needed; 1 1 also 
enables the comparison between different sequenc- 
ing projects without requiring that the same region 
be sequenced in each project. 

Another area where estimates about sequence 
error rales would be beneficial is ted mo logy devel- 
opment Accurate error estimates at each base would 
enable /'quality benchmarking" hetvyeeh different 
methods, thus enabling researchers to choose the 
method that fills their needs for accuracy and 
throughput best. 

Several groups have developed mathematical 
models to predict the error probability at any given 
position in raw sequences, Lawrence arid Sotovyev 
used linear discriminant analysis to ca fail ate sepa- 
rate probability estimates for Insertions, deletions, 
and mismatches (Lawrence and Solovyev 1994) . Kw- 
iny and Green (1998) developed the program 



1 E-MAIL p«t«r.rkhterlch©gcnomecorp.com; FAX (781) 69 1- 
9535. 



PHRKl), which calculates a quality spore at each 
base. This quality score 4 is logantliriucally linked to 
the error probability p. q = -10 X ip) (for a 
discussion of how quality scores arc calculated and 
what the limitations are, see Ewing et at (1998). 
When used in combination with sequence assembly 
and finishing programs that utilize these error esti- 
mates, reliable error probabilities promise to in- 
crease the accuracy of consensus sequences and to 
reduce the efforts required in the finishing phase of 
sequencing projects (Churchill and Waterman 
1992; Boniield and Staden 1995). 

To examine the accuracy of probability esti- 
mates made by the program PHRED, we compared 
the actual and predicted error rates for six different 
cosinid- 01 BAC-sized projects that were produced 
by six different large-scale sequencing centers hi the 
United Stales, All of these six projects used four- 
color fluorescent sequencing machines, however, 
the ON A preparation methods, sequencing en- 
zymes, fluorescent dyes and chemistries, and gel 
lengths varied sigriificantly between the six groups. 
Table 1 gives ah overview of the sequencing projects 
analyzed. Table 2 lists the different methods used. 

RESULTS 

Error Rate Prediction Accuracy for Six Projects 

A comparison of actual and predicted error rates for 
the six projects in this study is shown in Table 3. 
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Table 1. Summary of Data Sets 



Project 


Reads . • 


Aligned 
bases 


Average 
aligned 
■;. . read ■:• 
length 


A 


455 


416>214 


915 


B 


1277 


871,230 


682 


c .. ... 


1065 


603,655 


. < : ;;si7- .- 


D 


• 834 


414,595 


A97 


E 


1638 


1,149,209 


702 


f .,' ■ 


■ 1 885 


907,796 


; 482 , 


Total : 


; 7154 .' 


4,362,699 


610 



The results Indicate that PHRED is very successful in 
identifying bases with low error probabilities. For 
example, the 1.28 million bases with quality scores 
of 4-j2 (corresponding to error probabilities be- 
tween 39.8% and: 6.3%) contain a total of 137 > 926 
errors. I n contrast, the 1.44 million bases with qual- 
ity scores between 33 and 42 (corresponding to error 
probabilities between 0.05% and 0.006%) contain 
only 237 errors, which translates into a 790-fold 
lower error rate. The trend toward lower error rates 
can also be observed for each individual project. In 
most cases, the actual number of errors is close to 
the predicted error rate. It is also apparent that the 
actual error rate is typically lower than the predicted 
error rate. 

koUi the high overall accuracy and the ten- 
dency to slightly dverpredict errors arc confirmed 
by s tatistical arialysis, as shown in Table 4. The cor- 
relation between predicted and actual error frequen- 
ces is excellent for all projects {Spearman correla- 
tion coefficient >6 i 89 / 7 5 < 0.0001). Averaged over all 
projects, the achial error rate is 84.5% of the pre- 
dicted error i a tc; the slope of the relati on between 
predicted and actual error rates differs slightly be* 
t^cnp^ 

put these differences between projects in relation, it 
is worthwhile remembering that PUREE) quality 
scores coyer a wide dynamic range; The maximum 
quality score of 51 corresponds to a 50,000-fbld 
lower predicted error rate than the minimum qual- 
ity score pf 4, Even the relative difference be tween 
successive quality is larger than the relative differ- 
ence in the slopes; for example, a quality score of 10 
Corrcsp()nds to an error probability of 10%, whereas 
a score of 9 corresponds to an error probability of 
12,6%. 

A different way of looking at the relation be- 
tween the actual and predicted error rates is shown 



in Figure 1, Here, the error rate* as a function of the 
position within all read* in each of the projects, av- 
eraged over 50-base windows, is depicted, tor all six 
projects, the predicted error rates are very close; to 
the actual error rates oyer the entire length of the 
sequences; ^ 

tlon of error rates, which differs from each of the 
other protects. The minimum error rate differs dra- 
matically between projects. The best projects 
achieve raw error rates of 0.23%-0.36% In the best 
region of the sequence read, typically from base 150 
to 200. The worst project in the data set had an 
-l6-f61d tiipcr error rate of 2.58%. 

toward the end of sequence reads, the error 
rates increase and start to exceed 10% between bases 
300 arid 700. In projects that used mainly short gels 
(e.g., projects D and V), this increase begins sooner, 
whereas projects that use longer gels show a mark- 
edly longer stretch of low error rates (e.g., projects A 
and B). 

Table 5 summarizes key results for the six 
projects. The first four projects have simitar mini- 
mum and average error rates. However, the length 
of the region where the error rate is below 5% differs 
significantly, from 403 to 682 bases. The project 
with the shorter low error rate regions contained 
larger : portions of reads generated on short gels, 
whereas projects A and B were run exclusively on 
long gels (ABI373 stretch or ABI377 sequencers). 
Other factors contributing to differences between 
the first four projects were differences in sequencing 
chemistries, production scale, and electrophoresis 
conditions arid machines. 

Project E arid, in particular, project F, had sig- 
nificantly higher error rates than the first four 
projects. In projects E and F, every sequence gener- 
ated for the project had been included in the data 
set, whereas the other four projects had eliminated 
some "bad" sequences through manual or auto- 



Table 2. Overview of Sequencing Methods 
Used in the Different Projects 



Template DNA 

Sequencing 

eniymes 
Sequencing 

Chemistries 
Sequencing 

machines 
Gel length 



single-stranded Ml 3, 

doubte-stranded plasmids 
Sequeniaise, Taq, KlenTaqTR, 

AmpliTaq FS 
Dyes primer (two different dyes 

chemistries), dye termfnator 
ABI 373, ABI 373 stretch, 

ABI 377 
Only short gels, only long gels, 

mixes of short and long gels 
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Table 3. Comparison of Predicted and Actual Error Rates for Six Different 
Sequencing projects 



f-ProJ&ct;; 1 . 


Quality score 








33^2 


43-51 




aligned bases 
expected errors 
actual errors 


119,246 

2d-256\ 
16,784 


75,293 
2/064 
. 1,758 ' • 


76,391 
172 

, 127 :•: 


144,876 

• 37 ■ 
17 


73,234 

1 

1 




aligned bases 
expected errors 
actual errors 


182,034 

29,953 
26,038 


1 37,940 
3,704 
2,S36 • 


181,998 

2p ■ 


399,690 

102 

35 


140,176 

3 ■ 
0 




aiigned bases 
expected errors 
actual errors 


139,345 

22,277 

16>670 


131,419 

3/411 
1,513 


151,197 

357 " 
194 


292,070 

74 

26 


68,529 

2 

3 


D 


aligned bases 
expected errors 
actual errors 


103,898 

16,880 

14,495 


68,995 

1,919 

1,924 


68,61 3 

168 

146 


153,730 

38 

59 


111,752 

3 

2 


E 


aligned bases 
expected errors 
actual errors 


378,755 

63,947 

55,968 


217,438 
6; 336 
6,516 


167,968 

418 

355 


392,717 

95 

67 


144,313 

4 

5 


F 

All 


aligned bases 
expected errors 
actual errors 

aligned bases 


359,809 

66,938 

57,971 

1,283,087 


1 36,688 

4,079 

3,856 

767,77i 


98,840 

256 

332 

739,007 


64,035 

23 

33 

1,447,118 


5,130 

0 

1 

543,134 



expected errors 
actual errbrs 



187,926 



21013 
18/TOS 



1,781 
1,441 



370 
237 



13 

12 



niatic inspc>cti^ <10% of the 

worst sequences in project h, the error rates for the 
reinainirig sequences vvorQ comparable to those of 
the first four projects. In contrast, project F showed 
a much more uniform distribution of sequence 
quality. 



The last column in Table 5 shows the average 
number of bases with an estimated error probability 
of at most 0,1%, which is equivalent to a quality 
score of at least 30. The count of such "very high- 
quality" bases is a good indicator of sequence qual- 
ity, both for individual sequences and, when aver- 



table 4. Summary of Statistical Analysis Results 



project 


:: Spfearmari 


p>ipI ; 


Slope 


i ratio 


P>\t\ 


A 


0.9646 


<o.oboi 


0.818 


75.1 


<0.0001 


B 


0.9890 


<0.0001 


0,874 


98.2 


<0.0001 


C 


0.9846 


<0.0001 


0.766 


71.6 


<0,0001 


D* 


0.8692 


<0.0001 


0.855 


68.3 


<0.0001 


E 


0;$956 


<0;0001 


0;884 


144.3 


<0.0001 


F 


0 9968 


<0.0001 


0.865 


1-51 ;6 


<0.0001 


All 


0.9964 


<0.0001 


0.845 


174.5 


<o.oooi 



•In project D, the Spearman cbrrelaUon eoeffident p was artificially low as only very few bases (10) bases had 
u quality score; of 5, and noh e of these bases contained an actual error (expected: 3.16 errors). Exclusion of 
this quality store gave a ;S(3earman corrilatfon coefficient of 0,9786 (/' < 0.000 1 ). The frequencies in the slope 
<alcutetlohs:wer^ by the number of bases at any given quality score and, thus, were not sensitive to 

such small sampte 



GrNOMb KtSbARCH^25? 



ity analysis and control in large-scale DNA sequent 
ing projects. To analyze how accurate PHRED error 
estimates are for different quality sequences within 
the same sequencing ptoject, we subdivided a data 
set into four quaftiies, based on the number of very 



is 

shown In Figure 2. 

When measured by the error r&te in the best 
region of a sequence, the data quality in the differ- 
ent quartiles varies > 100-fold between the best and 
the worst Z5% of the sequences. The best quar tile 
showed -0.03% error for >100 bases, whereas the 
error rate iri; the worst quartile always exceeded 5%. 
In qqaftfltfs 2 and % the predicted error rates match 
the actual error rates very closely, in the best arid 
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worst quartiles, PHRED's accuracy was somewhat 
lower from base 100 to S00. Tn the best sequences, 
PH RRD's error estimates were about twofold too 
high; in the worst sequences, the error estimates 
were too low, again by u factor of 2. This underpre* 
diction of errors can be partially explained by the 
fa^itilat PMl^ I ) gives ambiguotis ;h«a& c|ils a 
quality store of 4, corresponding to an error prob- 
abihty of 39 8%; however, N's will always show up 
as : an actual etror. Even in the 1 worst and best quar- 
tiles, however, the predicted error rate curves are 
very siTnilar to the actual error rate curves. 

The result^ shown in Figure 2 also deinonstrate 
that the count of very higlvqucdity bases, ot bases 
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Figure 2 Actual and predicted error i 

6 Sequence reads were sorted by the number ot bases Willi a predicted error rate 
Of at most 6:1 % {very high-quality base;s)/ ahd assigned to quartiles, with quartlle 
1 rorrespohdihg to the highest numbers Actual and predicted error rates for all 
sequences in each subset; were calculated as in Fig. I. Note that a number of 
sequence reads that had been rejected because of too low quality were added 
batfc to the data set for illustrative purposes, all of which are In quartlle 4. These 
sequences were not included in the data sets used to generate Figs. 1 and 3 and 
Tables 1 and 1 



can be used effectively to characterise the overall 
quality of a sequence read. 
Sorting the Sequence reads 
into quartiles based on the 
number of very high-quality 
bases worked well, as shown 
by the >1 00-fold difference in 
the minimum error rate be- 
tween the first and th e fourth 
qiiartile". 

Oilier M^tbiods to charak:- 
terize the overall quality of in- 
dividual reads based on 
PHRED quality scores; can give 
similar results. For example, 
counting bases above a mini- 
mum quality threshold any- 
where in the range of 20-40 
gave similar results for most 
data sets (not shown), and 
such counts are used by a 
number of different laborato- 
ries as quality measures. Altcr- 
n atlveiy ; the quality va 1 ties 
can be converted to error 
probabilities aiid aver aged to 
give the predicted error rate 
for the trace, or summed to 
give Uic total predicted nurn^ 
-prt&xri i t> er of errors in a trace, How- 
even such averages and totals 
can sometimes give a mislead- 
ing picture/ as the following 
example illustrates. Assume 
that two sequence reads have 
very Similar quality In the 
alignable part of the read but 
that one of the two seiquences 
was run much longer and 
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:flfPW»-3. Actual ^meihift and total error rates for projects A arid B. Tb calcu- 
late frameshrfl error rates, only insertions and deletions were counted. Mismatch 
errors, which account for the vast majority of errors after base 1 50, were included 
only m the total error count. Note that project B (A,A) has a lightly similar or 
slightly higher total error rate compared to project A (#,0) but only about 
one-third as many insertions and deletions up to base 500. For both projects the 
frameshift error rate in the raw data is <1 in 1000 for >300 bases, and *\ in 
TQ,000:for>^ 



sionally lead to questionable 
conclusions, as the results 
shown in Figure 3lllustrate. 

Figure 3 shows the total 
actual error rates and the 
frameshift error rales for two 
projects, A and B. The total er- 
ror rates for both projects arc 
similar for up to Mi) bases; af- 
ter 3S0 bases, project B has a 
somewhat higher total erroT 
rate. ^ 

fraineshift error rate gives rise 
to a different picture: from 
base 1 to 5G0, project A has 
approximately four times as 
many insertions and dele- 
tions as project B This differ- 
ence in frameshjft error rates 
can be explained by the se- 
quencing chemistries that 
were used in the two projects. 
Project tf, with the lower 
frames hift error rate, used 
only dye terminator chemis- 
try, which is known to elimi- 
nate band spacing artifacts 
from 



therefore contains a longer unalignable "tail" of 
. very low-quality bases . When calculating the aver- 
age error rate for these two sequences, the second 
sequence will have a much higher average error and, 
therefore, appear to be of lower quality, in contrast, 
the counts of very high-quaiity bases for both se- 
quences will be veiy similar, as the unalignable tails 
contain few, if any, high-quality bises. Therefore, 
counts of bases above a high enough quality thresb- 
oid will give a more robust and clearer picture of 
trace quality. 



eshift Error Rates for Different Sequencing 
Chemistries 

Dcj^ndirjg on how biologists use: DNA sequences, 
knowledge abdxit total crj or rales in raw sequences 
irtay or biay not be sufficient. For example, frame- 
shift errors in coding sequences will generally lead 
to incorrectly predicted open reading frame, 
whereas mismatch errors will clo so only if the mis- 
match introduces a stop codon or a new splice site. 
At the time of this writing, PHRED did not differen- 
Uatc between mismatch and frameshift errors, but 
only estimated total error rates. This might occa- 



structures ("com 
pressions"). i'roject A, on the 
other hand, used dye primer chemistry, which is 
more prone to insertion arid deletion errors from 
mobility artifacts, for most sequencing reactions, 

DISCUSSION 

As large-scale DNA sequencing has become a more 
routine arid common process, the traditional meth- 
ods for assessing sequence quality have become un- 
satisfactory. In projects like single-pass cl )NA se- 
quencing, it Is not possible to calculate and compare 
error rates after finishing a sequence, as finishing 
never takes place. Even when a comparison between 
raw and fiirished sequeiice can be done, the time 
delay between raw data generation and quality as- 
sessment is often large. This delay makes it difficult 
to improve ongoing projects, and it sometimes 
makes it impossible to capture problems early on. 
Some immediate quality feedback can be reached by 
including known standard sequences for quality 
control. However; this approach can be costly, and 
it fails when error profiles differ between standard 
and unknown .sequences. 

In contrast to these traditional methods to as- 
sess sequence accuracy, direct estimation of error 
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rates in raw seq uence data would enable immediate 
quality control and feedback. Accurate/ basely- 
base estimates of error probabilities could also in- 
crease the utility of single-pa ^ sequences .signifiv 
cantly, allow efficient comparison and optimization 
of different sequence chemistries and enable "the 
development of better software tools for sequence 
assembly and analysis. 

The critical question for any error rate predic- 
tion tool is how accurate are the error rate estimates, 
In particular if different sequencing methods and 
chemistries are used? The results presented herein: 
provide an answer to this question for the program 

would be useful As shown in Tables 3 and 4 and in 
Figure 1, the agreement between predicted and ar- 
tu^l error rates was very good in each of the six 
different projects analyzed, The observed high level 
of prediction accuracy In all of these projects is al- 
most astonishing if one takes into account that ac- 
tual eifors are binary (a base is cither corrector 
Wrong), whereas predicted error rates are probabili- 
ties on a 5<:alc hpm 0.0 to 1,0 The observed ten- 
dency; to overpredict error rates can be at least par- 
tially explained by the "small sample correction" 
that was used in the derivation of threshold param- 
eters for quality scores (Kwlng and Green 1998). For 
most practical applications, such a somewhat con- 
servative estimation of quality scores is tolerable or 
even desirable. Overall, the results clearly show that 
error probabilities given by de- 
scribe raw sequence data quality, 

in judging the usefulness of predicted error 
probabilities, it fs important to know how differ- 
ences in sequencing methods will influence the pre- 
diction accuracy. For example, the larger variation 
in peak heights tends to be larger in dye terminator 
sequencing than in dye primer sequencing, and dif- 
ferent sequencing enrymes are known to produce 
different specific height variation patterns Any es- 
timation of error probabilities that takes the pecu- 
liarities of a specific sequencing chemistry into ac- 
count would therefore be expected to be less accu- 
rate fdr different chemistries 

The projects included in this study were specifi- 
cally chosen to provide an initial answer to the 
question of how generally useful RHRlvD quality 
scores are. These projects represent the vast majori ty 
of <liflcrent multicolor fluorescent sequencing 
methods used in the last 3 years, different teni plate 
DNAs and DNA preparation methods, different en- 
zyroes, gel lengths, run conditions, and different 
fluorescent dyes, The data also include a consider- 
able spread in data quality, both between projects 



ESIIMAIION or ERRORS IN RAW DNA SEOUENCES 

and within individual projects. None of the projects 
analysed here were included in PHRKiy* training 
set, and just one of the six laboratories that contrib- 
uted data to this study also contributed data to the 
training data sets. One of the projects in this sLudy 
consisted entirely of dye terminator sequences, 
which presented only a small fraction of the se- 
quences in the test data set Another project exclu- 

those us^ 

from the other projects in this study in at least one, 
and typically many, experimental aspects like tem- 
plate preparation, sequencing enzymes, gel run con- 
ditions, arid so forth.: Despite these differences,; the 
acctifacy of error rate predictions was very similar 
for ail projects. 

Our results justify some optimism about the ac- 
curacy of PHRED quality scores for minor changes 
in sequencing technology, for example, sequences 
generated by new enzymes and fluorescent dyes. 
Initial studies showed that PHRED quality scores 
were also accurate for sequences produced by mul- 
tiplex sequencing with radioactive detection (P, 
Richtench; unpubl ). However, we also observed 
two e|fects that can invalidate PHRED quality scores 
during, these studies. First, sequences generated by 
chemical sdqueiiciiig gave too low quIUty scores at 
mixed (A + G) reactions. Because secondary peak 
height is one of the parameters used in the error rate 
predictions, this is not surprising; Another potential 
source of error is high-frequency noise In the trace 
data. With such data, PHRED occasionally underes- 
timated the hand spacing by a factor of 2 or more, 
which resulted in incorrect base calls and quality 
scores. Dy applying simple smoothing algorithms to 
data with high-frequency nOise> these problems 
could typically be resolved. Similar Steps may be 
necessary to obtain accurate PHRED quality scores 
on data thai have been generated by different se- 
quencing instruments or preprocessed by different 
software. 

Accurate quality scores can have a major impact 
on how sequences are used downstream from the 
sequence production process. In traditional se- 
quencing projects where the goal Is complete cov- 
erage at a final error rate below (e,g.) 1 in 10,000, the 
accuracy goals can be reached with single sequence 
reads as long as the quality scores are at least 40 
(however, other potential problems like clone insta- 
bility may make higher coverage advisable). Inter- 
esting questions arise as to how individual read 
quality contributes to project quality, or the error 
rate of the "final" sequence. Under the assumption 
that errors between different sequence reads arc 



GENOME RESEARCH ^257 



{ 



RICHTEfilOH 

completely independent, one could argue that two 
reads with a quality $coic of 20 (error probability of 
1 in 100) are just as valuable as one Sequence wiih a 
quality score of 40 (error probability of 1 In 10,000). 
However, although a single sequence strclch with 
quality levels above 40 would give a Final sequence 
with an error rate of <1 in 10,000, assembling a con- 
sensus from two sequences with quality scores of 20 
(1% error rate) could lead to one of two results: Tf 
the errors were completely random, the consensus 
sequence would be ambiguous at 2% of all loca^ 
tions; if the errors were completely localized; for ex- 
ample, because of reproducible compressions, the 
consensus sequence 

every 1Q0 basws Typically, consensus sequences de- 
rived from low-quality sequences will have both 
kinds of problematic regions. Increased coverage 
can rapidly cjUminate the random errors; however, 
increased coverage dbes hdt resolve errors from sys- 
tematic sources; Manual examination of such prob- 
l^m *r^ ; ^ 

ing," however, tends to be time consuming, re- 
., quizes vKig^yJ ^trrai n|d : ;; :j^er $p riii^l>y -|s : ; :arT ^'st^cle^ • 
toward complete automation of DNA sequencing, 
and sometimes fails to eliminate all errors. This 
leads to the somewhat counterintuitive conclusion 

ity can be even higfrer thart indicated by the quality 
scores One sequence of average quality above 40 
can be "worth" more than two sequencer of average 
quality 2b. 

Another application of DNA sequencing where 
high quality can be of d isproportionately high value 
is the search For mutations in genomic DNA. In low 
quality sequences, secondary peaks and low resolu- 
tion often complicate the identification of hetero- 
zygous mutations. In regions of higher sequence 
quality, such secondary peaks are smaller or absent 

; pos^ errors can be sighifi- 

cantly reduced in high-quality regions. Tools like 
PHRED, which can accurately measure sequence 
qualjty from trace data, can he of twofold value for 
mutation detectibn jftrst, base-specific quality 
$c<ws caft allow optimisation of sequencing meth- 
ods and strategies for mutation detection. Second, 
the quality mrim can be used to evaluate the use- 
fulness of Individual sequence reads for mutation 
deletion (e.g., by discarding reads below minimum 
thresholds), and XMy can guide software that auto- 
matically detects; mutations. 

Tile ability lo predict error rates in a highly ac- 
curate fashion is likely to have a major impact in 
applications like those described above. PHRED is 



the first widely used program that accurately pre- 
dicts base-specific error probabilities, However, the 
algorithm for determining qualily values has been 
described (Ewing and Green 1998), and it should be 
straightforward to implement similar quality values 
ii* otji^ hLirthenuoie> ah ex- 

tension of the approach developed by Kwlng and 
Green shcnild be possible. For example> dlfferejxf ia- 
y<)n ; be^eii m 
enable better comp 

with similar total error rates but different frameshtft 
error rates Several groups have described efforts to 
calculate separate probabilities (or "confidence as- 
sessments") for mismatch errors and frames*! if t er- 
rors (Lawrence and Solovyev 1994; Bcrnb 1996). 
Their results demonstrated that different ap- 
proaches tc> error type characterization are feasible 
and promising, implementation of such error type 
predictions in other programs similar to. the way 
PHiRED uses quality scores would enable better 
method assessments, benchmarking, and prodiictipn 
quality coritrpl, and could have a significant impact 
• on dov^reiril uses of DNA sequence informafion. 

METHODS 



Tor one project, sequence raw data in the form of 
ABI trace files were downloaded from a public FTP 
site. Sequence data for the five other projects were 
kindly provided by five different large-scale se- 
quencing groups. Table I gives a .summary of the six 
projects, and Table 2 gives an overview of the dif- 
ferent sequencing methods used in the projects. The 
projects differed In the amount of prescreeriing of 
data that had been done, reflecting different ap- 
proaches to quality control in different laboratories. 
In two projects (li and C) ; different software pro- 
grams had been used lo identify and eliminate low- 
quality sequences. One project (F) included all data 
files generated, whereas the other three projects had 
excluded ^failed lanes:" 

Gbmparisoh of Accual and Predicted Error Rates 

The sequences for all traces in each project were 
recalled using the program PHRED (v. 96102H). 
Next; sequences in each project were assembled 
with PIIRAP (P. Green, unpubl.). Slightly different 
methods were chosen for the statistical and graphi- 
cal evaluation of the error rate prediction accuracy. 
In the statistical evaluation, only the longest cohtig 
produced by PHRAP was considered. The tables of 
aligned bases and observed discrepancy counts for 
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each quality score were taken from the PHRAP out- 
put and analyzed as follows. The expected number 
of discrepancies (E) at each quality score (*/) was cal- 
culated by multiplying the number of aligned bases 
{AO with the error probability corresponding tu the 
quality score: E*N 1CT° lM . The Spearman ranking 
coefficients were calculated by comparing the ex- 
pected and observed error frequencies. To obtain 
the quantitative relation between the expected and 
observed error rates over the entire range, a least- 
squares fit between the observed and expected rates 
was performed, with the intercept set to zero and 
the number of aligned bases at each quality score 
used as weights. 

hoi a graphical comparison of estimated and ac- 
tual erybr fates in 50-bp windows, the following 
steps were taken; For two of the projects, the con- 
sensus sequence wa$ retrieved from public data- 
bases, tor the; four other projects, the DNA sequence 
and quality informattori were used by the program 
PHRAP to assemble consensus sequences for eaiii of 
the projects. Tile individual reads were aligned to 
the consensus sequences of the longest corltig, us- 
ing Ihc program CROSS_MATCH (P. Green, un- 
publ.), after removing single-coverage regions from 
the ends of the consensus sequence. CROSS- 
_MATCH uses an implementation of the Smith 
Waterman algorithm to generate alignments that 
typically do not include the ends of sequences, 
Where disagreements are commonly due to vector 
sequence or low quality sequence. 

The quality files generated by PHRED and the 
alignment summaries generated by GRC3SS- 
^MATGH were then analyzed as follows First, the 
region of each query sequence that had been aligned 
by GR0SS_ MATCH was delcrniinc^. Next, the actual 
and predicted error rates for the entire aligned part of 
each individual sequence was Calculated In addi- 
tion, the average actual arid predicted error fates for 
all alighahle sequences together were calculated for 
windows of 50 bases in length To calculate the pre- 
dicted ewot rate, the quality scores ^ determined by 
ttHltEDat each base were converted to crrpr prob- 
abiHtiesas described above (Ewing and Green 1 998). 

Subdividing Data Into Subsists Based on Data Quality 

To examine the accuracy of PHRED quality scores 
for data subsets of different qua 1 1 ty wit hin a project, 
the following approach was taken. For all sequence 
reads iii project B, the number of bases with a qual- 
ity score of at least 30 In each sequence was deter- 
mined (b*se$ with quality scores of at least 30 were 
called very high-quality bases, or VHQ bases). Sc- 
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quences were sorted in descending order based on 
the number of very high-quality bases, and divided 

i^O^^f - : q^|tHeS; ; Acc$^ '. j' ; :eoit-. 

t^Iihfed^^^pf seqtiences wjih the highest number 

the "worst" sequences. To Illustrate the prediction 
accuracy in data with relatively high error rates, se- 
quences from project B that had been "discarded" 
because they had not met the minimum quality cri- 
teria were added back to the data set. The sequences 
In each quartilfe; wfere compared ^ 
quenee$ ; to gqnerat^u^ 
set, as described above for the graphical comparison 

petermfiiing Actual Frameshift Error Rates 

The calculation of actual frameshift error rates In 
the raw sequence data was performed using CROSS 
_MATCH, similar to the procedure described above 
for total error rates, except that only insertion and 
deletion errors were counted/ Because PHRED docs 
not give separate frameshift error estimates, a com- 
parison of predicted and actual franiesliift errors is 
not possible. 

I thank the participating laboratories for contributing their 
data, Dr. Josee Diipuis for help , with the statistical analysis, 
and Dr. Fljil -GrcejV for Jiclphil discussions. 

the publication costs of tills article were defrayed in part 
by payment of page charges. This article must therefore be 
hereby marked "adverlisemenl" In accordance with I ft USC 
section 1 734 solely to indicate this fact. 
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dateof July 10, 1998 was prior to the filing ^teofthefiewndiirovisionidanoUcatioflAtm 
No. 60/093 993, filed July 24, 1998, the benefit of which is SS»SE r? 
mformed tot the '993 application includes the sequence infomuSfon Ste to 
deposit of the sequenced material on page 16. lines 1-2. of the specification 

TOIMtJ a N 1 ovember 4 ^ Uca0t8 had withdrawn of the human 

DNMT3A cDNA clone contained within ATCC Deposit No. 98809 At the a™hZ£ 

spans the coding regions of the sequenced human DNMT3A cDNA done conSfa 
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ATCC Deposit No, 98809 and ouirently amended SEQ ED NO;3 ia shown in Fig 1 of 
Exhibits. b 

5. The amendment to the sequence listing, which was filed on July 23 2001 
conected six nucleotide errors In the coding sequence of SEQ ID NO:3 (see bolded 
nucleotides at nucleotide positions 940, 1476, 1479, 1 570, 2024 and 21 1 9 of amended SEQ 
D> NO;3 in Fig. 1 ofExmBrrB), The amendment also deleted original nucleotides 1-123 of 
SEQ ID NO;3, which does not include any DNMT3 A coding sequence. 

000 ™ L. 1,16 de P° 8lte4 clone misd » W and 5, above (Le.. ATCC Deposit No. 
98809) is the same as the deposited clone recited in the above-captioned application The 
six nucleotides in the coding region of SEQ ID NO:3 that were correctedby the amendment 
of July 23, 2001 correspond to the sequence contained in ATCC Deposit No. 98809. It is 
well known that sequencing errors are a common problem in Molecular Biology, Peter 
Richterich, Estimation of Errors in 'Raw' DNA Sequences: A Validation Study, 8 Genome 
Research 251-59 (1998)(EXHIBrT Q. I believe that one skilled in the art would have 
sequenced the deposited material and recognized the sequencing errors. 

7. My sequencing of ATCC Deposit No. 98809 also revealed that nucleotides 
539-584 within the coding region of amended SEQ U> NO:3 are deleted in the deposited 
cDNA. The deletion causes a frame shift in the coding region of the deposited cDNA and 
predicts a truncated protein of 145 amino acids. An amino acid alignment of the predicted 
ammo acid sequence encoded by the human DNMT3A cDNA clone contained in ATCC 
Deposit No. 98809 and the predicted amino acid sequence encoded by amended SEQ ID 
NO:3 is shown in Fig. 2 of Exhibit B, The bolded sequence^ Fig. 2 corresponds to the 
predicted encoded amino acid sequence downstream of the nucleotide deletion In ATCC 
Deposit No.98809 and represents a point of divergence compared with the predicted aniino 
acids encoded by currently amended SEQ ID NO:3. 

8. Currently amended SEQ ID NO:3 and SEQ ID NO:3 as originally filed in 
U.S. Appl. Nob. 60/090,906 and 60/093,993, to which priority is olaimed, do not harbor the 
deletion, and encode a protein having 912 amino acids that is homologous to mouse 
Dnmt3a. 

9. Like DNA sequence errors, it is known that errors in DNA cloning may occur 
m molecular biology. Deletion errors may occur and may be caused by, inter alia 
inadvertent digestion of DNA by restriction endonuclcases or exonucleases, or by 
recombination events during propagation of the DNA in bacterial hosts. The deletion at 
nucleotides 539-584 in SEQ ID NO:3 present In ATCC Deposit No. 98809 is an obvious 
error. I believe that one skilled in the art would have sequenced the deposited material and 
recognized the deletion as an error. My belief is based upon the following: First, the 
deletion found in ATCC Deposit No. 98809 is not present in SEQ ID NO:3 as originally 
filed or as amended. Second, the deletion is not present in Ihe DNA sequence of the closely 
related mouse homolog, SEQ ID NO:l. Third, the deletion causes a frame shift in the 
reading frame of SEQ H> NO:3 and predicts a truncated protein product compared with that 
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encoded by SEQ ID NO* as originally filed and as amended. Fourth, the amino acids 
encoded by the nucleotide sequence downstream of the deletion bear no similarity to the 
anriio acids eiwoded by SEQ roNO:3 or the mouse homologofrtonda, encoded by SEQ 
. ly ' an examtoati °n of the sequence reveals two large open readme frames 
(ORF) m the sequence in different frames, See Fig. 3 of Exhibit B. The ORFs correspond 
to the ammo acid residues ofDNMT3A upstream and downstream of the deletion. The 
presence of two large ORFs in different frames indicates a possible frameshifting sequence 
enwor deletion. All of these factors indicate that the deletion present in ATCC Deposit No 
98809 is an error, and would be recognized as such by a person of ordinary skill In the art. 

10 - It is my belief that the combination of ATCC Deposit No. 98809, which 
discloses the six nucleotides in the coding region ofSEQ ID NOi3 amended on July 23 
2001, in combination with SEQ ID NO:3 as originally filed in U.S. AppL Nos. 60/090,906 
and 60/093,993, which disclose nucleotides S39-584 of amended SEQ ID NO:3 clearly 
conveys to someone skilled in the art the entire nucleotide sequence of amended SEO ID 
NO:3. 

1 1. I hereby declare that all statements made herein of my own knowledge are 
true and 1 that all statements made on information and belief arc believed to be true: and 
forlher that these statements were made with the knowledge that willful false statements and 
J 6 , ,o S * f ade * e P unishable b y ^ ot imprisonment, or both, under Section 1001 of 
Title 1 8 of the United Stales Code and that such willful fake statements may jeopardize the 
validity of the present patent application or any patent issued thereon. 



Respectfully submitted, 



Date: 
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CURRICULUM VITAE 

PART It ftftn»r«l T»^ rmilf jnn 

DATE PREPARED: July 5, 2005 
Name: Kenneth D. Bloch 

Office Address: Cardiovascular Research Center 
Massachusetts General Hospital 
149 13* Street (149-4201) 
Charlestown, MA 02 1 29 
(617)724-9540 

Home Address: 80 Park Street, Apartment 32, Brookline, MA 02446 
E-Mail: kdbloch@partners.org FAX: (6l7)m , 5m 

Place of Birth: New York, NY 
Education: 

198? ? rownUnivers ity(Biomedicine) 
1*01 M.D. Brown University 

Postdoctoral Training: 

Internship and Re.qiHpr.PiV... 

1981-1984 Internal Medicine, Massachusetts General Hospital 

Fellowship s- 

lllllllt pl^f ,7 ™ Medicine > Harvard Medical School 
987- 989 r?™? ^i™ * Genetics > Harvard Medical School 

1989 Cluneal and Research Fellow in Medicine, Harvard Medical School 

Licensures and Certification: 

1989 Dip bmate, American Board of Internal Medicine 
Diplomate Subspecialty Board of Cardiovascular Diseases 
American Board of Internal Medicine U1S ™<*, 

Academic Appointments: 

llltmj ?c Stn f ^ edidne ' Harvard MedicaI s ^ool 

1990 1997 Assistant Professor of Medicine, Harvard Medical School 
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2005-" Associate Professor of Medicine, Harvard Medical School 

2005 Associate Professor of Anaesthesia, Harvard Medical S chool 

Hospital or Affiliated Institution Appointments: 

lllllm aST! £. Medidne ' Massachus ^s General Hospital 

999-2005 Assistant Physician, Massachusetts General Hospital 
1 999 2005 Associate Physician, Massachusetts General Hospital 
MM- Ph ysician, Massachusetts General Hospital 

Other Professional Positions and Major Visiting Appointments: none 
Hospital and Health Care Organization Service Appointments: 



1990- 



Attending Physician, Cardiology and Medical Services 



1 991- Massachusetts General Hospital 



Practice Member, Cardiac Unit Associates 
Massachusetts General Hospital 



Major Administrative Responsibilities: 

1989-1992 



1990- 1 " 2 W*? C ^iac Unit, Massachusetts General Hospital 

D serMrT Dire ?° r ' FeU ° WShip in CardiovSaf 

1 9Q9 disease, Massachusetts General Hospital 

ta2T HoS gat ° r ' Cardi ° VaSCUlar tenter, Massachusetts 

fSScV^^ST 1 ±e Cardiovascular Research Center (PI: 
Mark C. Fishman), Massachusetts General Hospital 

m^C^'Tl 01 ^ 0 ^ Cardi °™cular Research Center 
PI. Mark C. Fishman), Massachusetts General Hospital 

Cardi ° VaSCUlar *—* Cen ^ Massachusetts 

re^ P tr InVeS ? 8at0r ' Tfaining Grant t0 ** Cardiovascular Research 
Center, Massachusetts General Hospital "esearcn 

Major Committee Assignments: 



1993-1997 
1997-2002 

2002-2004 

2002- 



Hospital: 
1991 



Member, Selection Committee for the Fellowship Program in 
1992-1996 CMdiovascular Disease, Massachusetts General Hospkal 
1^2 1996 Member, Subcommittee on Review of Research Proposals Commits 

1997 2? R t SearCh ' Massa ^usetts General Hospital P ' 

Member, Steering Committee coordinating the integration of the 
clinical cardiology fellowship programs at ^Brigham S Women's 
Hospital and Massachusetts General Hospital 
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Regional: 
1991-1994 

2002 



National: 
1992-1995 

1996-1997 
2000 

2000-2002 

2002- 

2002- 

2005 

2005- 
2005- 



Member-At-Large, Executive Committee of the Council ott 

stssr " d Criacai care ' Amd - «« a ~°. 

MfiSfST" ^""J,""*' C0UnCil ° n C^iopulmouary and 
cSJ^p' National Cente 

Chairman, Program Committee, Council on Cardiopulmonary and 
OtaT Care, Amencan Hear. Association, National CemeT 

^„ C n 0n,m,,,ee °" Scien ' ifc *■"»» American Hear, 



Professional Societies: 



1997- 
2000 

2001 



American Society for Clinical Investigation 



Editorial Boards: 



1989- 



Ad hoc reviewer: 

American Journal of Pathology 

American Journal of Physiology 

ESSES? 1 of Respintory 0611 and Molecular Bio1 ^ 

Circulation 
Circulation Research 
Journal of Applied Physiology 
Journal of Biological Chemistry 
Journal of Clinical Investigation 
New England Journal of Medicine 
Nature Medicine 
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Proceedings of the National Academy of Sciences 



Awards and Honors: 



2001 
2002 
2003 



1978 Magna cum laude, Brown University 

1 984 1 98fi P a «cia McCormick Memorial Prize, Brown University 

986" 987 P 6 F f*° n Ship ' LeUkemla Socie * of America 
996" 998 P° s ^al Ffowship, Pfizer Pharmaceuticals 

996 ?Tt n "^ ld ' Ameri0an Heart AMOdrtoo, National Center (declined) 

2000 S^?p ClrC f ^ ? 0UnCil Avascular Research Compet t on } 
" Se^nd Prize for the basic science research abstracts 

rrom the Massachusetts Thoracic Society 

A^Zlt^f^ C0Undl ° n Basic Avascular Sciences, 
Amenean Heart Association 

AsToSon^ ° fthe ^ SdenCeS C0UnCi1 ' ****** Heart 

taaugural Fellow of the Council on Cardiopulmonary, Perioperative 
and Critical Care, American Heart Association 

Part II: Research. TWhinp , <, nd ciini^l r nn ^,.^^ 
A. Narrative Report 

Dr Bloch's research has focused on cardiovascular biology and the molecular 
CardiovascS 

e^wup uiscoverea mat NOb3-deficient mice are more susceotible to hvnnYia i n A»™A 
whom pulmonary vascular remodeling precedes the development of SoW 
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Dr. Bloch has brought into focus the fact that regulation of NO resnonsiveness m»v « 
important as regulation of NO pastaian. Dr. Bloch found 1 tha^^S.Za 'f 
vascular smoott .muscle cells (VSMC) to NO or pro-inflammaS^SX^ 

^ » «*? ?*- 0>«i the enzyme responsible fcS!pS5i 

in response to NO) desensiuzing the cells to NO. Dr. Bloch also showed that Sp 
dependen protein kinase, an enzyme responsible for vasodilation SiiriSLl. 

factors and one of which is a co-activator of nuclear hormonfr^eptors P 
Si ^ h ', S re i earCh f° Up has an important role for NO synthesized bv 

ic^ponsiDie ror the impairment of hypoxic pulmonary vasoconstriction associate with 
pulmonary injury associated with endotoxemia andvolutrauma 

™£JZf y ' , Dr ' B1 ° Ch ' S & ° UP haS begUn a new h » e of r ^ch directed at 

SmSir 0DS " *f g6ne enC ° ding b ° ne -orphogenetic pn>t in receptor 
type 2 (BMPR2) cause primary pulmonary hypertension. They have observed that mW 
carrymg one copy of a mutant BMPR2 gene have mild pulmona^ wSon 
associated with abnormalities of pulmonary vascular structure. ftyPertenS,0n 

c^ottlots l££T2l™ nt * b l! 0nS " CUnicaI ** research ^ of 

awarded to the Cardiovascular Research Center In this ml* TV mJu ■ \ 
riuin zuvz tnrough 2004, Dr. Bloch served as the Interim Director of th*» rvpr «wl • 

B. Funding Information (Research): 



Past: 

1991-1992 



Research Grant, Dr. Louis Skarow Memorial Fund (PI- K Bloch) 

1991-1994 ^nHwTfi 011 in - a m ?? el ° f Puhn0nary hypertension." 
1*94 ^f lt -^ d :^ e ^ Heart Association, National Center 
(M: K. Bloch) Pulmonary expression of endothelin genes." 



Kenneth D. Bloch-5 



i ) 



1991-1996 NHLBI/R29 (PI: K. Bloch) 

1995 1 996 " Bios ynthfsis of the endothelin family of vasoactive peptides." 

1995- 1996 Sponsored Research Agreement through the Cardiovascular 

Research Center from Bristol-Myers Squibb (PI: K. Bloch) 
Isolation and characterization of novel vascular nitric oxide 
synthases." 

1996- 1997 NHLBI/R01 (Co-PI: K. Bloch) 

1 00* oaaa ' The P ulmonar y response to inhaled particulates " 

1996-2000 NHLBI/R01 (PI: K. Bloch) 

1996-2001 225? ? Xi ff cGMP si 8 naI transduction in pulmonary injury." 

1996 2001 ajtotadta^ American Heart Association, NaZial Center 

irtL* ^ OC ^ kl Re g uIatl i on of * nitric oxide receptor component, the pi 

ioooa/v, subunit of soluble guanylate cyclase." P 

1998-2002 NHLBI/T32 (Co-PI: K. Blochf PI: M.C. Fishman) 

1998-2003 ,S5! W^SSR Cardi ° VaSCUlar bi0l °^" 
2001 2002 "Nitric oxide/cGMP signal transduction in vascular injury." 
2001-2002 Pf lzer Pharmaceuticals (Co-PI: K. Bloch; PI: MJ. SemigTan) 

Evalua ion of the effects of sildenafil with and without inhaled nitric 
oxide (NO) on platelet-mediated thrombosis and cardiac function n a 
camne model of cyclic coronary artery occlusion » 



Current 
1996- 



NHLBI/R01 (Co-PI: K. Bloch; PI: W. Zapol) 
Studies of inhaled nitric oxide." 
2002- INC- Therapeutics, Inc. (PI: K Bloch) 

"Laboratory-based initiatives for the further development of the 

SKC5 aWtiC0Xide " 

"BMPR2 in the pathogenesis of pulmonary hypertension" 

C Report of Current Research Activities (Bench and Clinical Research): 

Project 1 : Studies of inhaled nitric oxide. (Co-PI: K. Bloch) 

Project 2: Evaluation of the systemic effects of breathing nitric oxide (PI- K RlnoM 

Projec 3: Ro e of nitric oxide in left ventricular remodXg £l K Bloch) } 



Kenneth D. Bloch-6 



'Y 

I 



D. Report of Teaching 

L Local Contributions 



a. medical sr.hr>r>i 
1977 



Brown University, Biomed 110, Biophysics 
course director: Babette Stewart 
Teaching Assistant 
50 undergraduates (approx.) 

3-5 hours preparation and contact time/week (approx ) 



semester course 



1978-1981 



Brown University, Biomed 6, Introduction to Physiology 

course director: Peter Stewart 

Teaching Assistant 

50 undergraduates (approx.) 

3-5 hours preparation and contact time/week (approx ) 
semester course y 



1986 



Harvard University, Genetics 700.0, Fundmentals of Genetics 
course director: Philip Leder ^neiics 
Teaching Assistant 
15 medical students (approx.) 
3 hours preparation and contact time/week (approx ) 
semester course } 

I990 " 1992 SCh ° 01 ' Intr ° dUCti0n t0 C,iniCal Medici - 

t 2 nlr e o d i CaI StUd6ntS ' 3 sessions/wee ^ 2 hours/session, 3 weeks/year 
total: 9 hours preparation time, 1 8 hours contact time 

1993-1996 Harvard Medical School, Patient-Doctor II Course 

PrSeptor reCt ° rS: Katherine Treadwa y *nd Diane Fingold 

50 medical students (approx.) 

3 sessions, 2 hours/session, 1/year 

total: 9 hours preparation and contact time 

b. graduate medical rnnroA- none 
L local invited teaching pr^n^,-^,. 

1993-1994 "EAical Conduct of Research" Course, Massachusetts General 

S^W CePt ° r ' 5 " 1() P° stdoctoral ^lows, 2 sessions/year, 
1994 2 hours/session, prep time: 1 hour/session. 

Cardiology Grand Rounds, Massachusetts General Hospital, Lecturer- 
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1998 
1999 



50 attendees: medical students, residents, clinical and research fellow, 
1 99* fecultyj 4 hrs prep and 1 hr contact time. ° h fdl ° WS ' 

Anesthesiology Grand Rounds, Massachusetts General Hospital- 
Lecturer; 50 attendees: medical students, residents, dfatadSf ' 

4 hrs prep time and 1 hr contact time eU ° WS ' 

Nutation, Harvard School of Public Health, Lecture 50 aCdl 
medical students, residents, clinical and research feHows facut 

2001 S PrCP 311(1 1 ^ COntact time - 

West Roxbury Veterans Administration Hospital Cardioloev Grand 

2003 research fellows, faculty; 4 hrs prep and 2 hr contact time 

22T ^T n ' S H ° Spita1 ' Vascular Re ^ch Seminar- 30 

reSidentS ' Clinical research fellows 
faculty; 4 hrs prep and 2 hr contact time ' 

Massachusetts General Hospital, Cardiology Grand Rounds- 50 
attendees: nurses medical students, resided, climcS tsLt> 
fellows faculty; 5 hrs prep and 1 hr contact time. 
Massachusetts General Hospital, Critical Care Research Retreat- 50 
^tS^^ "hoursp^ 
4 continuing medical education courses: none 
e. advisory and supervisory responsibility 

(ToO^^ 

Research Advisor, Cardiology Fellowship Program, Massachusetts 



2004 
20O5 



1989- 
1989-1992 

1990- 
1991- 
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rS.S?' C0 f 0naiy Care Unit > Massachusetts General 



1992-2002 

1 992 ?? sp ! tal . ( variable hours/year) 



Pnn C1 pal Investigator, Cardiovascular Research Center, Massachusetts 
General Hospital, scientific supervisor for Research Fellows 

2002- ^eluding two Assistant Professors of Anesthesia (1 ,000 hours/year) 

2002 p„„ cl 1 m vestlgat MH^^, program ^ 

doctoral cardiovascular scientist? each year (200 hours/ye™ 

f. teaching l eadership mlp 



1995-1996 



SSS, n ' ReSearC ^ Seminar Series ' Massachusetts General 
Ho pital; Organizer; A weekly series of presentation by staff and 
senior fellows designed to highlight research in the Cardiac Zt 
Cardiac Unit Society of Fellows, Massachusetts General Hospteu- 

SETT* T^l l eries of symposia p resented ^ «2Br 

1 997 ££2 * f 5° *T ° f ^ Chief of ^ MGH Cardiac Unit. 

Society of Cardiology Fellows, Massachusetts General Hospital and 
Bngham and Women's Hospital; Co-Organizer; A quarteZeries 0 f 
symposia designed to foster scientific communication and 
collaboration between MGH and BWH and to highlight research 

CVRC Seminar Senes; Organizer; A weekly seminar series presented 
by visiting scientists in the MGH Cardiovascular R^hSr 

g. names of advisees anH train ees/current p noitinno 



1989-1990 



Charles C. Hong, MD/PhD, Clinical and Research Fellow 
1990-1991 ?T;I? Vision, Massachusetts General Hospital ' 

991- 99] AS ert c? Ch0tt ' mK > private P factice 

Akito Shimouchi, MD, Assistant Professor of Medicine, National 
1990-1991 ^diovascular Center Research Institute, Osaka, Japan 

SSSfi inTTr' ^ PhD ' AsS ° Ciate Profess ° r of Medicine, 
Cardiac Umt and Laboratory for Molecular and Vascular Biolo^v 

1992-1994 2001 mT? H ° SpM Gasth ^berg, Leuven, Belgium 
1992 1994, 2001- Nonko Kawai, MD, PhD, Research Fellow in Medicine 

1 992- 1991 Cardiovascular Research Center, Massachusetts General Hospital 
1992 1993 John J Lepore, MD, Instructor of Medicine, University of 

Pennsylvania Medical Center, Department of Medicine 
Sorato^ Medidne DiViSi ° n ^ MolecuIar Cardiology 
Johanna Wolfram, MD, University Clinic for Internal Medicine II 

1993- 1999 Ey**? General Hos P ital > Vienna^ AuST ' 
1999 Sanchez, MD, Instructor in Pediatrics, Harvard Medical 

School/Massachusetts General Hospital ara Medical 



1993-1994 
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1996-1998 
1996-1998 

1998-2002 



1993- 2004 Jesse p. Roberts, MD, Associate Professor of Anesthesiology in 

1994- 1995 EX?S ^ M f Cal Sch00l/M ^chusetts General Hospital 
isw 1995 Jeffrey Thomas, MD, Associate Professor of Neurosurgery Chief of 

Neurovascular and Neurointerventional Surgery, Division of 
Neurosurgery, The University of New Mexico Health Sciences 
Center 

1994- 1996 Alexandra Holzmann, MD, Department of Anesthesiology 

University of Heidelberg 

q^"?qo« ? 6lin S LiU ' MD ' 0n leave t0 care for her children 

1995- 1998 Jean-Daniel Chiche, MD, Professor, Cochin University, Paris, 

r nuicc 

lllfllll Honkanen > MD. Private practice 

1996- 1998 Masao Takata, MD/PhD, Assistant Professor, Department of 
Anaesthetics and Intensive Care, Imperial College School of 
Medicine, Hammersmith Hospital, London, United Kingdom 
Douglas Wirtnlin, MD, Assistant Professor of Surgery Vascular 
Surgery, University of Alabama at Birmingham 
Joerg Weimann, MD, Professor of Anesthesia, Department of 
Anesthesiology and Intensive Care Medicine, Charite -Berlin 
Medical School, Campus Benjamin Franklin 
Galina Filippov, MD, Research Scientist, Omnigene Bioproducts 

1998, 2000-2001 Zena Que^do, MD, Chief, Department of Anesthesia and Surgical 
Services Warren G. Magnuson Clinical Center, National mstitutes 
oi Health 

1998-2000 Roman Ullrich, MD, Associate Professor of Anesthesia and 

Intensive Care Medicine.Vienna General Hospital, Medical 
University of Vienna 

Hiroshi Nakajima, MD, Neurosurgery Residency, Tokyo Women's 
Medical College, Tokyo, Japan 

Pini Orbach, PhD, Project Manager, Drug Development, Perdix 
Pharmaceuticals, Inc. 

Fumito Ichinose, MD, PhD, Assistant Professor of Anesthesia 
?r>m on/n ^ Medical School/Massachusetts General Hospital ' 

2001- 2003 Arniee Limbach, PhD, Post-doctoral Fellow, Center for Human 

Medicat Cente n r etiCS ' Munr ° e " Meyer Ibstta ^ University of Nebraska 

2002- 2003 Elisabeth Choe, MD, Resident in Internal Medicine, University of 

Texas Southwestern 

2002-2004 Cornelius Busch, MD, Resident in Anesthesia, Department of 

Anesthesiology, University of Heidelberg 

Uj 2f 6 ^ eUr ' MD ' Resident in Anesthesiology, Lille, France 

Hideyuki Beppu, MD, PhD, Instructor in Medicine, Cardiovascular 
Research Center, Massachusetts General Hospital 
Manu Buys, PhD, Research Fellow in Medicine, Cardiovascular 
Research Center, Massachusetts General Hospital 



1999-2001 
1999-2001 
1999- 



2003- 
2003- 
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2003 p^i Yu, MD, PhD, Clinical and Research Fellow in Medicine 
2004-2005 Cardiovascular Research Center, Massachusetts General Hospital 

2004 2005 David Bayne , undergraduate student, Harvard University P 
Ryuji Hataishi, MD, PhD, Research Fellow in Anesthesia 
Massachusetts General Hospital 

HoS ^ ^ " lMemal 
SZ. 1 ?! 8 "- M ^-, C " nical Fellow in Medicine, 

SS5 hospm • ReseMch Fellow in " sia - Ma " 



2004- 
2004- 

2005- 



.. Regional, National, or International r.nntrir,nt,™o 



1994 
1994 
1994 
1995 

1996 
1996 

1997 

1997 
1998 

1998 

1998 

1999 

1999 

1999 
1999 

1999 

1999 

1999 

2004 



Invited Lecturer, American College of Cardiology, Dallas TX 
Invited Lecturer, Pfizer Pharmaceuticals, GrotonCT ' 
Invited Lecturer, University of Leuven, Belgium' 

Invited Lecturer, Boston Heart Foundation, Boston, MA 
hivited Lecturer, Georgia Medical College, Vascular Biology 
Division, Atlanta, GA sy 

Invited Lecturer, Harvard Medical School, Vascular Biology 
Seminar, Boston, MA * y 
Invited Lecturer, Boston University, Whittaker Foundation Boston MA 

Invited Lecturer, Brigham and Women's Hospital, Cardiology 
Division, Monday Morning Research Conference, Boston, MA 

S2^^*™S' eW Y " k MediCal C0 " ege > De~toT 
Pharmacology Seminar, New York NY 

Invited Lecturer, Tufts University School of Medicine, Dept. of 
Medicine, Boston, MA y 
Invited Lecturer, Tufts University School of Medicine/New England 
Medica Center, Pulmonary and Critical Care Division, Boston MA 
Invited Lecturer, Millenium Pharmaceuticals, Inc., Cambridge mT 
Seme, wT^ ° f Washin ^ on ' Cardi; *>gy Dept 

Invited Lectoer University of Alabama at Birmingham, Dept of 
Pathology, Birmingham, AL 6 ' P 

Invited Lecturer, 3 rd International Society for Medical Gases 
Meeting, Heidelberg, Germany 

Invited Lecturer, National Institute of Health, Critical Care 
Medicine, Bethesda, MD 

CnaSS"' W ° Therapeudcs bc - ^^fic Advisory Board, 
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2002 

2002 

2003 

2003 

2003 

2004 

2004 



wS CtUrer ' Medical College of wisconsin ' m ™*". 

ZEST* AmeriCa " ^ ASS ° Ciati0n Scie, " ific Session 

ssa c ~ « bo S(0 „ 

Invited Lecturer, American Heart Association, Northeast Affili*. 
Symposnun: Launching a Career in CWtaJ£r5£2 

-^e^S^X^en^r^ 

H, Cambridge MA_- Wtric oxide/coSSS^^ 
imphcattons for cardiovascular gene transfer » TOnsductlon - 
bmted Lecturer Critical Therapeutics, Inc., Lexington MA- 
Mechantam, of pulmonary vascular dysflu ; ction Xg^_. 
mstghts gamed from genetically-modified mice." W ' 

E. Report of Clinical Activities 

^^1^^^^ ™ft a P-«ce within the Cardiac Unit 
consists of patients JS^nrnhW ^ assac 5 usetts Ge neral Hospital. His practice 
referred to atJ^Z^ ° f * moderate * W* ^vel of comple>dty 



2004 



2004 
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Exhibit B 

FIGURE 1 



Alignment of human DNMT3A from ATCC Deposit No. 98809 (top) and currently 

amended SEQ ID NO: 3 (bottom) 1 
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1 1 1 1 1 1 1 1 1 1 Ml 1 1 1 1 III 1 1 1 1 1 II II II 1 1 II 1 1 1 1 II 1 1 II III 1 1 1 1 1 1 1 II 1 1| 

gagccctgctggggggcagaagggcggggccccagcagagggagagggtgcagctgagac 600 
cctgcctgaagcctcaagagcagtggaaaatggctgctgcacccccaaggagggccgagg 

I M II I M II II 1 1 M II I II 1 1 II II II I II I II 1 1 II II 1 1 1 II II 1 1 1 1 1 II 1 1 1 II 

cctgcctgaagcctcaagagcagtggaaaatggctgctgcacccccaaggagggccgagg 6 60 
agcccctgcagaagcgggcaaagaacagaaggagaccaacatcgaatccatgaaaatgga 

MMIIMMMMIMMMMMMMMMMIMMMIMMIIIIII MIMM 

agcccctgcagaagcgggcaaagaacagaaggagaccaacatcgaatccatgaaaatgga 720 



1 Bolded nucleotides indicate nucleotides that were amended on July 23, 2001. 
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gggctcccggggccggctgcggggtggcttgggctgggagtccagcctccgtcagcggcc 

MM I Mill I M 1 1 III MINIM I III III Mill 1 1 IMI III 1 1 Ml III Mill 

gggctcccggggccggctgcggggtggcttgggctgggagtccagcctccgtcagcggcc 7 80 
catgccgaggctcaccttccaggcgggggacccctactacatcagcaagcgcaagcggga 

Ml M II II MM IMIIIIIIIMIIII MIIIIIMI Mill Ml I llllllll MM 

catgccgaggctcaccttccaggcgggggacccctactacatcagcaagcgcaagcggga 84 0 
cgagtggctggcacgctggaaaagggaggctgagaagaaagccaaggtcattgcaggaat 

1 1 1 f 1 1 f 1 1 1 1 1 f 1 1 1 1 1 1 1 1 1 1 f I M I M II 1 1 1 1 1 1 1 1 1 1 1 1 1 II I M I 

cgagtggctggcacgctggaaaagggaggctgagaagaaagccaaggtcattgcaggaat 900 
gaatgctgtggaagaaaaccaggggcccggggagtctcagaaggtggaggaggccagccc 

f 1 1 1 1 1 f 1 1 1 1 1 1 1 1 1 1 1 1 r 1 1 1 1 1 1 1 ii 1 1 1 1 1 1 1 1 1 1 1 1 m 1 1 1 1 1 1 1 1 1 1 1 1 m 1 1 1 

gaatgctgtggaagaaaaccaggggcccggggagtctcagaaggtggaggaggccagccc 960 
tcctgctgtgcagcagcccactgaccccgcatcccccactgtggctaccacgcctgagcc 

1 1 1 1 1 r 1 1 1 3 1 1 r t e 1 1 1 1 1 1 1 1 f 1 1 r 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 j 1 1 1 j i ; 1 1 r j 1 1 1 1 

tcctgctgtgcagcagcccactgaccccgcatcccccactgtggctaccacgcctgagcc 102 0 
cgtggggtccgatgctggggacaagaatgccaccaaagcaggcgatgacgagccagagta 

M Ml II lllllllll llllllll IMMMI IMIMMII II Mill IMIIMIIM 

cgtggggtccgatgctggggacaagaatgccaccaaagcaggcgatgacgagccagagta 10 8 0 
cgaggacggccggggctttggcattggggagctggtgtgggggaaactgcggggcttctc 

II I I.I 1 1 1 1 1 1 1 Ml II 1 1 II 1 1 1 III 1 1 1 1 MM III MINI II 1 1 1 1 II III I II II 

cgaggacggccggggctttggcattggggagctggtgtgggggaaactgcggggcttctc 114 0 
ctggtggccaggccgcattgtgtcttggtggatgacgggccggagccgagcagctgaagg 

1 1 I I I I M I M M 1 I I II I I I I 1 1 I I I I I IE II I I I I I M I f M 1 1 I I I M I 1 1 

ctggtggccaggccgcattgtgtcttggtggatgacgggccggagccgagcagctgaagg 12 0 0 
cacccgctgggtcatgtggttcggagacggcaaattctcagtggtgtgtgttgagaagct 

1 1 1 1 M II 1 1 II II I II 1 1 1 II 1 1 IMI II 1 1 1 II 1 1 1 1 1 Ml II 1 1 1 1 1 Ml I II 1 1 II 

cacccgctgggtcatgtggttcggagacggcaaattctcagtggtgtgtgttgagaagct 12 6 0 
gatgccgctgagctcgttttgcagtgcgttccaccaggccacgtacaacaagcagcccat 

II IIMIII I II llllll I llllllll II llllll MIMIII 1 1 IIMMI llllllll 

gatgccgctgagctcgttttgcagtgcgttccaccaggccacgtacaacaagcagcccat 132 0 
gtaccgcaaagccatctacgaggtcctgcaggtggccagcagccgcgcggggaagctgtt 

MM II Mill MM MIMIII Mill Ml MIIIIIIIIIIIIIIM lllllll MM 

gtaccgcaaagccatctacgaggtcctgcaggtggccagcagccgcgcggggaagctgtt 13 80 
cccggtgtgccacgacagcgatgagagtgacactgccaaggccgtggaggtgcagaacaa 

II 1 1 1 1 1 1 1 1 llllllll 1 1 lllllll llllllll MM I lllllll I III I lllllll I 

cccggtgtgccacgacagcgatgagagtgacactgccaaggccgtggaggtgcagaacaa 144 0 
gcccatgattgaatgggccctggggggcttccagccttctggccctaagggcctggagcc 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

gcccatgattgaatgggcc.ctggggggcttccagccttctggccctaagggcctggagcc 1500 
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accagaagaagagaagaatccctacaaagaagtgtacacggacatgtgqqtqqaacctaa 

IIIIMIIIIIIIIIIIIIIIIIIIIIIMIIIMIIIIII Illlllllllll 

accagaagaagagaagaatccctacaaagaagtgtacacggacatgtgggtggaacctga 1560 
ggcagctgcctacgcaccacctccaccagccaaaaagccccggaagagcacagcgqaqaa 

iiiiiiiiiiii iiiii 111 i ii 1 1 ii inn i ii iiiiii 1 1 

ggcagctgcctacgcaccacctccaccagccaaaaagccccggaagagcacagcggagaa 162 0 
gcccaaggtcaaggagattattgatgagcgcacaagagagcggctggtgtacgaggtqcq 

I M II 1 1 MIMIIIMII IMIIIIIIIMMIIIIIIIIIIII 

gcccaaggtcaaggagattattgatgagcgcacaagagagcggctggtgtacgaggtgcg 16 8 0 
gcagaagtgccggaacattgaggacatctgcatctcctgtgggagcctcaatgttaccct 

1 1 1 M M 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Ill Mill 

gcagaagtgccggaacattgaggacatctgcatctcctgtgggagcctcaatgttaccct 174 0 
ggaacaccccctcttcgttggaggaatgtgccaaaactgcaagaactgctttctggagtg 

MM I II I IIIIII II MINI M IIIIII II I MM I M Mil I Ml I Ml III 1 1 II I 

ggaacaccccctcttcgttggaggaatgtgccaaaactgcaagaactgctttctggagtg 18 00 
tgcgtaccagtacgacgacgacggctaccagtcctactgcaccatctgctgtgggggccg 

MM I IlillllMI MUNI 1 1111111111111111 1 11111111111111111111 

tgcgtaccagtacgacgacgacggctaccagtcctactgcaccatctgctgtgggggccg 18 60 
tgaggtgctcatgtgcggaaacaacaactgctgcaggtgcttttgcgtggagtgtgtqqa 

IIIIMIIIIIIIIIIIIIIII Mill IIIII I Mill IIIIII I 

tgaggtgctcatgtgcggaaacaacaactgctgcaggtgcttttgcgtggagtgtgtgga 192 0 
cctcttggtggggccgggggctgcccaggcagccattaaggaagacccctggaactgcta 

Ml MMIMMMMMMMMMMMMMMMMMMMMIIIIIMIIMI 

cctcttggtggggccgggggctgcccaggcagccattaaggaagacccctggaactgcta 198 0 
catgtgcgggcacaagggtacctacgggctgctgcggcggcgagaggactggccctcccg 

1 1 1 1 1 1 1 1 1 1 1 1 1 f 1 1 1 f 1 1 1 1 1 f 1 1 1 1 1 1 1 1 1 1 1 1 1 1 f 1 1 1 1 1 f 1 1 1 j 1 1 1 1 1 1 1 1 1 1 1 

catgtgcgggcacaagggtacctacgggctgctgcggcggcgagaggactggccctcccg 2 04 0 
gctccagatgttcttcgctaataaccacgaccaggaatttgaccctccaaaggtttaccc 

M I II llllllll Mill II IIMIMI IMIIMIII I Mill II IMIMIII Mill 

gctccagatgttcttcgctaataaccacgaccaggaatttgaccctccaaaggtttaccc 210 0 
acctgtcccagctgagaagaggaagcccatccgggtgctgtctctctttgatggaatcgc 

1 1 1 1 1 1 1 1 1 1 1 1 1 1 i 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

acctgtcccagctgagaagaggaagcccatccgggtgctgtctctctttgatggaatcgc 216 0 
tacagggctcctggtgctgaaggacttgggcattcaggtggaccgctacattgcctcgga 

M 1 1 1 1 1 1 1 1 1 II III 1 1 Mill INI 1 1 III 1 1 1 II 1 1 1 M II II 1 1 1 1 II 1 1 1 1 II 1 1 

tacagggctcctggtgctgaaggacttgggcattcaggtggaccgctacattgcctcgga 222 0 
ggtgtgtgaggactccatcacggtgggcatggtgcggcaccaggggaagatcatgtacgt 

miiimiimimmmMiiiiiiimiiMiimmiM i 

ggtgtgtgaggactccatcacggtgggcatggtgcggcaccaggggaagatcatgtacgt 22 8 0 
cggggacgtccgcagcgtcacacagaagcatatccaggagtggggcccattcgatctggt 

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 > 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ! 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

cggggacgtccgcagcgtcacacagaagcatatccaggagtggggcccattcgatctggt 2 340 
gattgggggcagtccctgcaatgacctctccatcgtcaaccctgctcgcaagggcctcta 

I M 1 1 1 1 1 J 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 f 1 1 1 1 1 1 1 1 M 1 1 1 1 f I i 1 1 1 1 1 1 1 M 1 1 1 1 1 1 1 1 1 

gattgggggcagtccctgcaatgacctctccatcgtcaaccctgctcgcaagggcctcta 24 0 0 
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cgagggcactggccggctcttctttgagttctaccgcctcctgcatgatgcgcggcccaa 

Mill IIIIIMM IMIIIIIIIIIIIIIIIIIIIIIMIIIMIIIIIIMIIIMII 

cgagggcactggccggctcttctttgagttctaccgcctcctgcatgatgcgcggcccaa 2460 
ggagggagatgatcgccccttcttctggctctttgagaatgtggtggccatgggcgttag 

Mill IIIIIIIIMMIIMIIIIMIIIIIIIIIIIIIIIIIIII lllllll I Mill 

ggagggagatgatcgccccttcttctggctctttgagaatgtggtggccatgggcgttag 252 0 
tgacaagagggacatctcgcgatttctcgagtccaaccctgtgatgattgatgccaaaga 

III 1 1 1 1 1 1 1 II I II 1 1 II II 1 1 II II M 1 1 1 II I M 1 1 1 1 1 II 1 1! 1 1 1 1 1 !! 1 1 1 1 1 1 

tgacaagagggacatctcgcgatttctcgagtccaaccctgtgatgattgatgccaaaga 25 80 
agtgtcagctgcacacagggcccgctacttctggggtaaccttcccggtatgaacaggcc 

1 1 1 1 M 1 1 1 1 1 1 1 1 i 1 1 1 1 1 1 1 J 1 1 1 1 i 1 1 1 1 1 i J I f 1 1 1 1 1 1 1 1 1 ! 1 1 1 1 1 1 1 1 1 1 1 1 1 

agtgtcagctgcacacagggcccgctacttctggggtaaccttcccggtatgaacaggcc 2 64 0 
gttggcatccactgtgaatgataagctggagctgcaggagtgtctggagcatggcaggat 

II 1 1 1 1 1 1 1 1 1 1 1 1 1 M 1 1 1 1 1 1 1 1 1 1 1 1 1 1 II 1 1 1 1 1 1 1 1 1 1 1 1 1 1 i 1 1 1 1 1 1 1 1 1 1 1 1 

gttggcatccactgtgaatgataagctggagctgcaggagtgtctggagcatggcaggat 2 7 00 
agccaagttcagcaaagtgaggaccattactacgaggtcaaactccataaagcagggcaa 

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 M ti 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

agccaagttcagcaaagtgaggaccattactacgaggtcaaactccataaagcagggcaa 2760 
agaccagcattttcctgtcttcatgaatgagaaagaggacatcttatggtgcactgaaat 

1 1 1 1 1 1 1 1 1 1 1 1 1 1 II 1 1 1 1 1 M 1 1 1 1 M I M I II II I M 1 1 1 1 1 1 1 M 1 1 M 1 1 1 1 1 1 I 

agaccagcattttcctgtcttcatgaatgagaaagaggacatcttatggtgcactgaaat 2 82 0 
ggaaagggtatttggtttcccagtccactatactgacgtctccaacatgagccgcttggc 

Ml II II Mill lllllll 1 1 lllllll II I II II 1 1 lllllll IIIIIMM I MM II 

ggaaagggtatttggtttcccagtccactatactgacgtctccaacatgagccgcttggc 2 88 0 
gaggcagagactgctgggccggtcatggagcgtgccagtcatccgccacctcttcgctcc 

II II II 1 1 M 1 1 1 1 II II I II II I II I II 1 1 II II 1 1 1 II II II I II 1 1 1 II II 1 1 1 1 II 

gaggcagagactgctgggccggtcatggagcgtgccagtcatccgccacctcttcgctcc 2 94 0 
gctgaaggagtattttgcgtgtgtgtaagggacatgggggcaaactgaggtagcg 

Ml II II I Ml 1 1 IIMM 1 1 MUM I lllllll 1 1 Mill II III Mill M I 

gctgaaggagtattttgcgtgtgtgtaagggacatgggggcaaactgaggtagcg 2 995 
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FIGURE 2 

Alignment of predicted amino acids encoded by DNMT3A cDNA in ATCC Deposit 
No. 98809 (top) and predicted amino acids encoded by currently amended SEQ ID 

NO:3 (bottom) 2 



MPAMPSSGPGDTSSSAAEREEDRKDGEEQEEPRGKEERQEPSTTARKVGRPGRKR 55 

MPAMPSSGPGDTSSSAAEREEDRKDGEEQEEPRGKEERQEPSTTARKVGRPGRKR 55 

KHPPVESGDTPKDPAVISKSPSMAQDSGASELLPNGDLEKRSEPQPEERVQLRPC 110 

KHPPVESGDTPKDPAVISKSPSMAQDSGASELLPNGDLEKRSEPQPEEGSPAGGQ. 110 

LKPQEQWKMAAAPPRRAEEPLQKRAKNRRRPTSNP*KWRAPGAGCGVAWAGSPAS 165 

KGGAPAEGEGAAETLPEASRAVENGCCTPKEGRGAPAEAGKEQKETNIESMKMEG 16 5 

VSGPCRGSPSRRGTPTTSASASGTSGWHAGKGRLRRKPRSLQE*MLWKKTRGPGS 22 0 

SRGRLRGGLGWESSLRQRPMPRLTFQAGDPYYISKRKRDEWLARWKREAEKKAKV 22 0 

LRRWRRPALLLCSSPLTPHPPLWLPRLSPWGPMLGTRMPPKQAMTSQSTRTAGAL 2 75 

GMNAVEENQGPGESQKVEEASPPAVQQPTDPASPTVATTPEPVGSDAGDKNIAAT 2 75 

ALGSWCGGNCGASPGGQAALCLGG*RAGAEQLKAPAGSCGSETANSQWCVLRS*C 33 0 

KAGDDEPEYEDGRGFGIGELWGKLRGFSWWPGRIVSWWMTGRSRAAEGTRWVMW 3 3 0 

R * ARF AVR S TRPRTT S S P CTAKP S TRS CRWP AAARG S C SRC ATT AMR VTL P RP WR 3 85 

FGDGKFSWCVEKLMPLSSFCSAFHQATYNKQPMYRKAIYEVLQVASSRAGKLFP 3 85 

CRTSP * LNGPWGASSLLALRAWSHQKKRRIPTKKCTRTCGWNLRQLPTHHLHQPK 440 

VCHDSDESDTAKAVEVQNKPMIEWALGGFQPSGPKGLEPPEEEKNPYKEVYTDMW 44 0 

SPGRAQRRSPRSRRLLMSAQESGWCTRCGRSAGTLRTSASPVGASMLPWNTPSSL 4 95 

VE PEAAAYAP P P P AKKPRKS TAEKP KVKE 1 1 DERTRERL VYE VRQKCRN IEDICI 495 

EECAKTARTAFWSVRTSTTTTATSPTAPSAVGAVRCSCAETTTAAGAFAWSVWTS 550 

SCGSLNVTLEHPLFVGGMCQNCKNCFLECAYQYDDDGYQS YCTICCGGREVLMCG 550 

WWGRGLPRQPLRKTPGTATCAGTRVPTGCCGGERTGPPGSRCSSLITTTRNLTLQ 6 05 

NNNCCRCFCVECVDLLVGPGAAQAAIKEDPWNCYMCGHKGTYGLLRRREDWPSRL 6 05 

RFTHLSQLRRGSPSGCCLSLMESLQGSWC*RTWAFRWTATLPRRCVRTPSRWAWC 6 60 

QMFFANNHDQEFDPPKVYPPVPAEKRKPIRVLSLFDGIATGLLVLKDLGIQVDRY 660 

GTRGRSCTSGTSAASHRSISRSGAHSIW*LGAVPAMTSPSSTLLARASTRALAGS 715 

IASEVCEDSITVGMVRHQGKIMYVGDVRSVTQKHIQEWGPFDLVIGGSPCNDLSI 715 

SLSSTASCMMRGPRREMIAPSSGSLRMWWPWALVTRGTSRDFSSPTL* *LMPKKC 7 70 

VNPARKGL YEGTGRLFFEF YRLLHDARPKEGDDRPFFWLFENWAMGVSDKRD IS 77 0 

QLHTGPATSGVTFPV*TGRWHPL*MISWSCRSVWSMAG*PSSAK*GPLLRGQTP* 825 

RFLESNPVMIDAKEVSAAHRARYFWGNLPGMNRPLASTVNDKLELQECLEHGRIA 825 



2 * indicates a predicted stop codon. Bolded amino acids are encoded by nucleotides located downstream 
of the deletion. 



n 
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SRAKTSIFLSS*MRKRTSYGALKWKGYLVSQSTILTSPT*AAWRGRDCWAGHGAC 880 

KFS KVRT ITTRSNS I KQGKDQHFPVFMNEKED I LWCTEMERVFGFPVHYTDVSNM 880 

Q S SAT SSLR*RSI LRVCKGHGGKL R * * AAWRG 912 

SRLARQRLLGRSWSVPVIRHLFAPLKEYFACV 912 
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FIGURE 3 



Human DNMT3A from ATCC Deposit No. 98809 with forward reading frames 
(DNMT3A amino acid residues bolded) 

DNA : GCCGCGGCACCAGGGCGCGCAGCCGGGCCGGCCCGACCCCACCGGCCATAC 5 1 

+3: RGTRARSRAGPTPPAIR 

+2: P RHQGAQPGRPDPTGHT 

+ 1: AAAPGRAAG PAR PHR P Y 

DNA : GGTGGAGCCATCGAAGCCCCCACCCACAGGCTGACAGAGGCACCGTTCACC 102 

+ 3: WSHRSPHPQADRGTVHQ 

+ 2 : VEPSKPPPTG*QRHRSP 

+ 1: GGAIEAPTHRLTEAPFT 

DNA : AGAGGGCTCAACACCGGGATCTATGTTTAAGTTTTAACTCTCGCCTCCAAA 153 

+ 3: RAQHRDLCLSFNSRLQR 

*+2: EGSTPGSMFKF*LSPPK 

+ 1: RGLNTGIYV*VLTLASK 

DNA : GACCACGATAATTCCTTCCCCAAAGCCCAGCAGCCCCCCAGCCCCGCGCAG 2 04 

+ 3: PR* FLPQSPAAPQPRAA 

+2: TTIIPSPKPSSPPAPRS 

+ 1: DHDNSFPKAQQPPS PAQ 

DNA : CCCCAGCCTGCCTCCCGGCGCCCAGlAT^CCCGCCATGCCCTCCAGCGGCCC 2 55 

+ 3: PACLPAPRC PPCPPAAP 

+ 2: PSLPPGAQMPAMPSSGP 

+ 1: PQPASRRPDARHALQRP 

DNA : CGGGGACACCAGCAGCTCTGCTGCGGAGCGGGAGGAGGACCGAAAGGACGG 3 0 6 

+ 3: GTPAALLRSGRRTERTE 
+ 2: GDTSSSAAEREEDRKDG 

+ 1: RGHQQLCCGAGGGPKGR 

DNA : AGAGGAGCAGGAGGAGCCGCGTGGCAAGGAGGAGCGCCAAGAGCCCAGCAC 357 
+ 3: RSRRSRVARR SAKS PAP 

+2: EEQEEPRGKEERQEPST 
+ 1: RGAGGAAWQGGAPRAQH 

DNA : CACGGCACGGAAGGTGGGGCGGCCTGGGAGGAAGCGCAAGCACCCCCCGGT 4 0 8 

+3: RHGRWGGLGGS AS T P RW 

+2: TARKVGRPGRKRKHPPV 

+ 1: HGTEGGAAWEEAQAPPG 

DNA : GGAAAGCGGTGACACGCCAAAGGACCCTGCGGTGATCTCCAAGTCCCCATC 45 9 

+ 3: KAVTRQRTLR* SPS PHP 

+2: ESGDTPKDPAVISKSPS 
+ 1: GKR* HAKGPCGDLQVPI 

DNA : ' CATGGCCCAGGACTCAGGCGCCTCAGAGCTATTACCCAATGGGGACTTGGA 510 
+ 3: WPRTQAPQSYYPMGTWR 
+2: MA QDSGASELLPNGDLE 

+ 1: HGPGLRRLRAITQWGLG 

DNA : GAAGCGGAGTGAGCCCCAGCCAGAGGAG 561 

+3: S G V S P S Q R R 

+2: KRSBPQPB E DELETION 

+ 1: E A E * A P A R G E 
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AGGGTGCAGCTGAGACCCTGCCTGAAGC 

DNA: G c s * D P A * S 

+3: p V Q Xj R P C L K P 

+2: q A A B T L P B A 

+1: 



III S RAVSHGCCTPKEGK 



♦ 3. P C R S 0 Q R T B G D « ^ p , R 

DMA : AATGQAGGGCTCCCGOTGCCGGCTGCGG^GTGGCTTGGGCTGGGAGTCCAG 
+3 ! . N G G L P G P A A G Q s p A 

.." . R „ A . p ,° » a i » • • * « - * 8 8 



M E G S R <3 R 



.GGCTCACCTTCCAGGCGGGGGACCCCTA 

5T pTHTTa e 'a h , p o r G q O ^ J t 

t 2 : t Vo fl »V« c pV*% 



DNA: CTACATCAGCAAGCGCAAGCGGGACGAGTGGCTGGCACGCTGGAAAAGGGA 
+ 3 : L H Q Q A Q A G R V A Q R G R 



+1: y I S K R K R 



.TTGCAGGAATGAATGCTGTGGAAGAAAA 



DNA: GGCTGAGAAGAAAGCCAAGGTCATTGCAGU^x^^^^-- r r 

+ 3: G * E E s ^ _ j. q jj * M L H K K T 

-I A L R K A ^ G M N A V E EN 

DNA : CCAGGGGCCCGGGGAGTCTCAGAAGGTGGAGGAGGCCAGCCCTCCTGCTGT 

t3 , P G A R G V S E G a L L L c 

DNA 

+3 . a ^ « _ p H p p L 



^ : GCAGCAGCCCACTGACCCCGCATCCCCCACTGTGGCTACCACGCCTGAGCC 

.., A A A B * P _ p l W L P R I- S P 

S*. « S P P P" P' A" P P T V A T T P E P 

.CAAGAATGCCACCAAAGCAGGCGATGACGA 



612 



663 



714 



765 



DNA: CGTGGGGTCCGATGCTGGGGACAAGAAT^^ g R R * R 

+ 3: R G V R C W G Q B C H U r ft M T , 

+ G . P „\ l fl °» K Va 



DNA: GCCAGAGTACGAGGACOO^^ W ^" — - Q y G 

+ 3 : A R V R G R P G L W H W g ^ q q q 

+2 = «- S . T . R n T fl A R G * % X G B X. V W G 



VCGAGGACGGCCGGGGCTTTGGCATTGGGGAGCTGGTGTGGGG 

R G R P 
r R T A G 

+ 3: E T A G h L L V A R * c l G Q * 

«- t c .°. A /.*- 0 - » • » * - 8 » " " 



816 



867 



918 



969 



1020 



1071 



1122 



1173 



o 



o 
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DNA : GACGGGCCGGAGCCGAGCAGCTGAAGGCACCCGCTGGGTCATGTGGTTCGG 
+ 3:DGPEPSS*R HPLGHVVR 

+2: , RAGAEQLKAPAGS CGS E 

+ 1: TGR SRAAEGTRWVMWFG 

DNA: AGACGGCAAATTCTCAGTGGTGTGTGTTGAGAAGCTGATGCCGCTGAGCTC 

+ 3: RRQILSGVC*EADAAEL 

+2: TANSQWCVLRS * CR * A R 

+ 1: DGKFSVVCVEKLMPLSS 

DNA: GTTTTGCAGTGCGTTCCACCAGGCCACGTACAACAAGCAGCCCATGTACCG 

+3: VLQCVPPGHVQQAAHVP 

+2: FAVRSTRPRTTS S PCTA 

+ 1: F C S A FHQATYNK Q P M Y R 

DNA: CAAAGCCATCTACGAGGTCCTGCAGGTGGCCAGCAGCCGCGCGGGGAAGCT 

+3: QSHLRGPAGGQQPRGEA 

+ 2: KPSTRSCRWPAAARGSC 

+ 1: KAIYEVL QVAS SRAGKIi 

DNA: GTTCCCGGTGTGCCACGACAGCGATGAGAGTGACACTGCCAAGGCCGTGGA 

+ 3: VPGVPRQR*E*HCQGRG 

+ 2: SRCAT.TAMRVTLPRPWR 

+ 1: FPVCHDSDESDTAKAVE 

DNA: GGTGCAGAACAAGCCCATGATTGAATGGGCCCTGGGGGGCTTCCAGCCTTC 

+3: GAEQAHD*MGPGGLPAF 

+2: CRTS P*LNGPWGASSLL 

+1: VQNKPMIEWALGGFQPS 

DNA : TGGCCCTAAGGGCCTGGAGCCACCAGAAGAAGAGAAGAATCCCTACAAAGA 

+ 3: WP*GPGATRRREESLQR 

+2: ALRAWSHQKKRRI PTKK 

+ 1: GPKGLEPPEEEKNPYKE 

DNA: AGTGTACACGGACATGTGGGTGGAACCTGAGGCAGCTGCCTACGCACCACC 

+3: SVHGHVGGT*GSCLRTT 

+2: CTRTCGWNLRQLPTH HL 

+1: VYTDMWVEPEAAAYAPP 

DNA : TCCACCAGCCAAAAAGCCCCGGAAGAGCACAGCGGAGAAGCCCAAGGTCAA 

+ 3: .STSQKAPEEHSGEAQGQ 

+2: HQPKSPGRAQRRS PRSR 
+1:.PPAKKPRKSTAEKPKVK 

DNA: GGAGATTATTGATGAGCGCACAAGAGAGCGGCTGGTGTACGAGGTGCGGCA 

+ 3: G D Y * * AHKRAAGVRGAA 

+2: RLLMSAQESGWCTRCGR 

+1: EIIDERTRERLVYEVRQ 

DNA: GAAGTGCCGGAACATTGAGGACATCTGCATCTCCTGTGGGAGCCTCAATGT 

+3: EVPEH*GHLHLLWEPQC 

+2: SA.OTLRTSASPVGASML 

+ 1: KCRNIEDICISCGSLNV 

DNA : TACCCTGGAACACCCCCTCTTCGTTGGAGGAATGTGCCAAAACTGCAAGAA 

+ 3: YPGTPPLRWRNVPKLQE 

+2: PWNTPS S LEECAKTART 

+ 1: TLEHP LFVGGMCQNCKN 



1224 



1275 



1326 



1377 



1428 



1479 



1530 



1581 



1632 



1683 



1734 



1785 



() 
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DNA : CTGCTTTCTGGAGTGTGCGTACCAGTACGACGACGACGGCTACCAGTCCTA 

+ 3: LLSGVCVPVRRRRLPVL 

+ 2: AFWSVRTS TTTTATSPT 

+1: CFLECAYQYDDDGYQSY 

DNA: CTGCACCATCTGCTGTGGGGGCCGTGAGGTGCTCATGTGCGGAAACAACAA 

+3: LHHLLWGP*GAHVRKQQ 

+ 2 : APSAVGAVRCS CAETTT 

+ 1: CTICCGGREVLMCGNNN 

DNA: CTGCTGCAGGTGCTTTTGCGTGGAGTGTGTGGACCTCTTGGTGGGGCCGGG 

+ 3: LLQVLLRGVCGPLGGAG 

+ 2: AAGAFAW SVWT_S WWGRG 

+ 1: CCRCFCVECVDLLVGPG 

DNA : GGCTGCCCAGGCAGCCATTAAGGAAGACCCCTGGAACTGCTACATGTGCGG 

+ 3: GCPGSH*GRPLELLHVR 

+2: LPRQPLRKTPGTATCAG 

+ 1: AAQAAIKEDPWNCYMCG 

DNA: GCACAAGGGTACCTACGGGCTGCTGCGGCGGCGAGAGGACTGGCCCTCCCG 

+ 3: AQGYLRAAAAARGLALP 

+ 2: TRVPTGCCGGERTGPPG 

+ 1: HKGTY GLLRRREDWP SR 

DNA : GCTCCAGATGTTCTTCGCTAATAACCACGACCAGGAATTTGACCCTCCAAA 

+ 3: APDVLR* * PRPGI * PSK 

+ 2: SRCSSLITTTRNLTLQR 

+ 1: LQMFFANNHDQEFDPPK 

DNA : GGTTTACCCACCTGTCCCAGCTGAGAAGAGGAAGCCCATCCGGGTGCTGTC 

+ 3: GLPTCPS * EEEAHPGAV 

+ 2: FTHLSQLRRGSPSGCCL 

+ 1: VYPPVPAEKRKPIRVLS 

DNA: TCTCTTTGATGGAATCGCTACAGGGCTCCTGGTGCTGAAGGACTTGGGCAT 

+ 3: S Ij * WNRYRAPGAEG LGH 

+ 2: SLMESLQGSWC *RTWAF 

+ 1: LFDGIATGLLVLKDLGI 

DNA TCAGGTGGACCGCTACATTGCCTCGGAGGTGTGTGAGGACTCCATCACGGT 

4-3: SGGPLHCLGGV*GLHHG 

+ 2: RWTATLPRRCVRTPSRW 

+ 1: QVDRYIAS EVCEDSITV 

DNA : GGGCATGGTGCGGCACCAGGGGAAGATCATGTACGTCGGGGACGTCCGCAG 

+ 3: GHGAAPGE DHVRRGRPQ 

+ 2: AWCGTRGRSCTSGTSAA 

+1: GMVRHQGKIMYVGDVRS 

DNA : ■ CGTCACACAGAAGCATATCCAGGAGTGGGGCCCATTCGATCTGGTGATTGG 

+ 3: RHTEAYPGVGPIRSGDW 

+ 2: SHRSISRSGAHSIW*LG 

+ 1: VTQKHIQEWGPFDLVIG 

DNA: GGGCAGTCCCTGCAATGACCTCTCCATCGTCAACCCTGCTCGCAAGGGCCT 

+ 3: GQSLQ* PLHRQPCSQGP 

+ 2: AVPAMTSPSSTLLARAS 

+ 1: GSPCNDLSIVNPARKGL 



1836 



1887 



1938 



1989 



2040 



2091 



2142 



2193 



2244 



2295 



2346 



2397 
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DNA : CTACGAGGGCACTGGCCGGCTCTTCTTTGAGTTCTACCGCCTCCTGCATGA 244 8 
+3: LRGHWPALL*VLPPPA* 
+2: TRALAGSSLSSTASCMM 
+1: YEGTGRLFFEFYRLLHD 

DNA : TGCGCGGCCCAAGGAGGGAGATGATCGCCCCTTCTTCTGGCTCTTTGAGAA 24 99 
+3: CAAQGGR* SPLLLAL * E 
+ 2: RGPRREMIAPSSGSLRM 
+ 1: ARPKEGDDRPFFWLFEN 

DNA: TGTGGTGGCCATGGGCGTTAGTGACAAGAGGGACATCTCGCGATTTCTCGA 2 5 50 
+3: CGGHGR**QEGHLAISR 
+2: WW.PWALVTRGTSRDFSS 
+ 1: VVAMGVSDKRDI SRFLE 

DNA : GTCCAACCCTGTGATGATTGATGCCAAAGAAGTGTCAGCTGCACACAGGGC 2 6 01 

+3: VQPCDD*CQRSVSCTQG 
+2: PTL* *LMPKKCQLHTGP 

+ 1: SNPVMIDAKEVSAAHRA 

DNA : CCGCTACTTCTGGGGTAACCTTCCCGGTATGAACAGGCCGTTGGCATCCAC 2 652 

+3: PLLLG* PSRYEQAVGIH 
+2: ATSGVTFPV*TGRWHPL 
+1: RYFWGNLPGMNRPLAST 

DNA : TGTGAATGATAAGCTGGAGCTGCAGGAGTGTCTGGAGCATGGCAGGATAGC 27 03 

+3: C E * *AGAAGVSGAWQDS 
+2: *MISWSCRSVWSMAG*P 

+ 1: VNDKLELQECLEHGRIA 

DNA : CAAGTTCAGCAAAGTGAGGACCATTACTACGAGGTCAAACTCCATAAAGCA 2 7 54 

+ 3: QVQQSEDHYYEVKLHKA 

+2: SSAK*GPLLRGQTP* SR 

41: KFSKVRTITTRSNSIKQ 

DNA : GGGCAAAGACCAGCATTTTCCTGTCTTCATGAATGAGAAAGAGGACATCTT 2 8 05 

43: GQRPAFSCL HE* ERGHL 
42:. AKTSIFLSS *MRKRTSY 
4l: GKDQHFPVFMNBKEDIL 

DNA : ATGGTGCACTGAAATGGAAAGGGTATTTGGTTTCCCAGTCCACTATACTGA 2 856 
43:MVH*NGKGIWFPSPLY* 
42: GALKWKGYLVSQS TI LT 

41: WCTEMERVFGFPVHYTD 

DNA : CGTCTCCAACATGAGCCGCTTGGCGAGGCAGAGACTGCTGGGCCGGTCATG 2907 

43: RLQHEPLGEAETAGPVM 

42: SPT*AAWRGRDCWAGHG 

41: VSNMSRLARQRLLGRSW 

DNA : GAGCGTGCCAGTCATCCGCCACCTCTTCGCTCCGCTGAAGGAGTATTTTGC 2 95 8 

43: ERASHPPPLRSAEGVFC 

42: ACQSSATSSLR*RS ILR 

41: SV PVI R H L FAP LKEYFA 

DNA : GTGTGTGTAAGGGACATGGGGGCAAACTGAGGTAGCG 2 9 95 

43: VCVRDMGAN* GS 

42: VCKGHGGKLR* 

4l: CV*GTWGQTEVA 
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As UNA sequencing Is performed more and more in a mas^prcductloivlike mariner, efficient quality control 
Vrteasu!# but so also does the ability to compare different 

Methods and projects, Qne of the fundamental quality measures In sequencing projects is the poskton-specfflc 
er# probability at all bases in each individual sequence. Accurate prediction of base-specific error rates from 
'■raw" sequence data would allow immediate quality control as well as benchmarking different methods and 
projects while avoiding the Inefficiencies and time delays associated with resequenclng and assessments after 
'•finishing" a sequence. The program PHRED provides base-specific quality scores that are logarythmically 
related to 0Qf pt^0^^ i^ stu# asi#sed the acciiracy of P^jKl^s error-rate prediction by analyzing 
sequendjig projects from six different large-scale sequencing laboratories, All projects used fbur<olor 
fluorescent sequfcnefag, but the sequencing methods used varied widely between the different projects. The 
results Innate that the error-rate predictions such as those given by PHPtEp can be highly accurate for a large 
vaflety of different sequencing methods as well as over a wide range of sequence quality. 



of sequences can bo very valuable. For example, dif- 
ferent lafge-seale sequencing projects may produce 
Sequences at similar rates a^d costs but with signlfi 
eafitly different erroi rates in the final sequence. 
One ;rna0r determinant in the final error rate is the 
accuracy of the "raw" sequence. Knowledge about 
the frequency and location of errors in the law se- 
quence data can help to direct "polishing" efforts to 
the places where additional effort is needed/ it also 
enables the comparts()ii htjtwecn different scquenc- 
1 ng projects without requiring that the saine region 
be sequenced In each project 

Another area where estimates about sequence 
error rales would be beneficial is tecluiolpgy devel- 
opment Acaitate error estimates at each basfe would 
enable ''quaiay benchmarkihg'' between different 
methods, thus enabling mearehcxs to choose the 
method that fills their needs For accuracy and 
■ tlirou$ip^ ; . 

Several gr<)ups have developed mathematical 
models to predict the error probabi lity at any given 
position in raw sequences. Lawrence and Solovyev 
used linear discriminant analysis to calculate sepa- 
rate pi bb^ estimates for Insertions, deletions, 
and mismatches (Lawrence and Solovyev 1994). raw- 
ing and Green (1998) developed the program 



'E-MAIL p«t«r.rkht*rlch0genome<:arp.coni; FAX (781) 89*- 



PHRKD, which calculates a quality score at each 
base |his q^ali|y score^ is logaTithimcally linked to 
the error proNbility p: q= ~ 10 x 16g,o (p) (for a 
discussion of how quality scores are calculated and 
what the limitations are, see Ewing et ai. (199ft). 
When used in combination with sequence assembly 
and finishing programs that utilize these error esti- 
mates, reliable error probabilities promise to in- 
crease the accuracy of consensus sequences and to 
reduce the efforts required in the finishing phase of 
sequencing projects (Churchill arid Waterman 
1992; Bonfieid and Staden 1995). 

To examine the accuracy of probability esti- 
mates made by the program PHRED, we compared 
the acnial and predicted error rates for six different 
cosuiid- or BAC-sized projects that were produced 
by six different large-scale sequencing centers in the 
United States, All of these six projects used four- 
color fluorescent sequencing machines; however, 
the PNA preparation methods, sequencing en- 
gines, fluorescent dyes and chemistries, and gel 
lengths varied significantly between the six groups. 
Table 1 gives ait overview of the sequencing projects 
analyzed. Table 2 lists the different methods used 



RESULTS 

Error Rate Prediction Accuracy for Six Projects 

A comparison of actual and predicted error rates for 
the six projects in this study is shown in Table 3. 
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Summary of Data Sets 






bases 


Average 
aligned 

j le^gjifh . 


. A •'• 

■> 

C 


1277 

.■ . 
• 111 

1638 

1885 


416,214 
871,230 
603,655 
414,595 
1,149,209 
907,796 


915 

682 : 

567 

4^7 ' ;•: 
■ 482. ' 


Total: 




4,362,699 


6l0 



Ijie resuit.s Indicate that PHRED is veiy successful in 
identifying bases with low error probabilities. For 
example* the 1;2* million bases with quality scores 
of 4-12 (corresponding to error probabilities be- 
tween 39.8% and 6.3%) contain a total of 187,926 
errors. In contrast, the 1.44 million bases with qual- 

!2T!Sf bClWeen 33 Und 42 ( TO " es Pondlng to error 
probabilities between 0.05% and 0.006%) contain 
only 237 errors, which translates Into a 790-fold 
lower error rate. Hie trend toward lower error rates 
can also be observed for each individual project In 
most cases, the actual number of errors is close to 
the predicted error rate, it is also apparent that the 
actual error rale is typically lower than the predicted 
error rate. 1 

Both the high overall accuracy and the ten- 
dency to slightly overpredict errors ate confirmed 
by statistical analysis, as shown In table 4. the cor- 
relation between predicted and actual error freuuem 
eies is: excellent for all projects (Spearman cdrtela- 
tton coefficient >0.89, P < 0.0001). Averaged over all 
projects, the actual error rate Is 84.5% of tlic pre- 
dicted error rale; Hie slope of the relation between 
predicted and actual error rates differs slightly be- 
tween projects and ranges from 76.6% to 88 4% To 
put these differences between projects in relation, it 
lb worthwhile remembering that PIIRED quality 
scores cover a wide dynamic range: Hie maximum 
quality score of 51 corresponds to a 50,000-fold 
ower predicted error rate than the minimum qual- 
ity score of 4. Even the relative difference between 
successive quality is larger than the relative differ- 
ence m the slopes; for example, a quality score of 10 
corresponds to an error probability of 10%, Whereas 
a^cpre of 9 corresponds to an error probability of 
12.6%. " 

A different way of looking at the relation be- 
tween the actual and predicted error rates is shown 



m Figure 1. Here, the error fates as a function of the 
F sitlon within all te^s mm. of the projects, av^- 
eragedover 5M*se windows, is depicted, ibr all six 
projects, the predicted mm rates are very close to 
the actual error rates over the ehnie length of the 
sequences. Bach project ^*:^tti^t»ti^.«!f^btt,. 
tion of error rates, which differs from each of the 
other projects. The minimum error Tate differs dra- 
matically between projects. The best projects 
achieve raw error rates of 0.23M36% In the best 
region of the sequence read, typically from base 150 
to 200. The worst project in the data set had an 
~10-fold higher cirtor rate of 2.58%. 

Toward the end of sequence reads, the error 
rates increase and start to exceed 10% between bases 
.100 and 700. In projects that used mainly short eels 
(e.g., projects D and 1% this increase begins sooner 
whereas projects that use longer gels show a mark 
and B)' nger StretCh ° now errorrates '(*•*. projects A 

Table 5 summarizes key results for the six 
projects. The first four projects have similar mini- 
mum and average error rates. However, the length 
of the region where the error rate i.\ below 5% differs 
significantiy, from 403 to 682 bases. The project 
with the shorter low error rate regions contained 
larger portions of reads generated on short eels 
whereas projects A and B Were run exclusively on 
long gels (ABI373 stretch or ABI377 sequencers) 
Other factors contributing to differences between 
the first four projects were differences in sequencing 
chemistries, production scale, and electrophoresis 
conditions and machines. 

Project E and, in particular, project F, had sig- 
nificantly higher error rates than the first four 
projects. In projects £ and F, every sequence gener- 
ated for the project had been included in the data 
scl, whereas the other four projects had eliminated 
some 'bad" sequences through manual or auto- 



SSTif \i, OV ^ ,eW of *«l u *»<««g Methods 
Used In the Different Projects 

template DMA 



Sequencing 
enzymes 

Sequencing 
chemistries 

Sequencing 
machines 

Gel length 



single-stranded Ml 3, 

double-stranded plasmids 
Scquenase, Taq, KlenTaqTR, 

AmpliTaq FS 
Dyes primer (two different dyes 

chemistries), dye terminator 
ABI 373, ABI 373 stretch, 

ABI 377 
Only short gels, only long gels, 

mixes of short and long gels 
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Table i. Comparison of Predicted and Actual Error Rates for Six Different 
sequencing Projects 





Quality store , 


4-12 


■ nm 


23-32 


11*42. 


43^51 




ailaned bases 
expected errors 


119 246 
20,256 

; 16,784;, 


2,064 
1,758 


172 ' 


■ 37 


: / .1,2 34 
' 1 


:1|;'V::V' :::; '' 


Signed bases 
expected errors 
actual errors 


182,034 

29,953 

26,038 


1 37,940 

3,704 

2,536: 


181>998 
410 

287 ■ 


3?9;690 
102 


140,176: 

' 0 




aJianed bases 
expecled errors 
attu^l errors 


139 345 

22,277: 
16 '670 


131419 

3,411 
1,513 


Tj ly I 7/ 

357 
194 


74 
26 


69*529 
' 2 
3 


n 

u 


aifgneei bases 
expected errors 
a^i^l errors 


103v898 

16;880 

14,495 


68,995 

1,919 

1,924 


68y613 

168 

146 


153,730 

38 

59 


111,752 

3 

2 


i 


aligned 

expected Errors 
actual errors 


378/755 

63,947 

55,968 


217,438 

6,336 

6,516 


167,968 

418 

355 


392,717 

95 

67 


144,313 
4 

5 


F 


aligned base? 
expected errors 
acttjaj errors 


359,809 

66,938 

57,971 


136,688 

4,079 

3,856 


98^840 

256 

332 


64,035 

23 

33 


5,130 

0 

1 



All aligned bases 1,283,087 

expected errors 220,252 
actual errors 187,926 



767,773 
21,513 
T 8,103 



739>O07 
1>781 ' 



1,447,118 
370 

m 



343,134 
13 

12 • 



matic l^pcctlun. After elinilriatlng <10% of the 
worst sequences in project K, the error rate for the 
reitia in In g sequences were coin parable to those 1 of 
the first four projects. In contrast; projed F showed 
a much more uniform distribution of sequence 
quality. 



The last column in Table. 5 shows the average 
number of bases with an estimated error probability 
of at most 0.1%, which is equivalent to a quality 
score of at least 30. The count of such ''very high- 
quality" bases is a good indicator of sequence qual- 
ity, both for individual sequences and, when aver- 



Table 4. Summary of Statistical Analysis Results 



Project 


Spearman 




Slope 


t ratio 




A 


0.9646 


<o.o6oi 


0.818 


75.1 


<0.0001 


B 


0.9890 


<0.0001 


0.874 


98.2 


<0.0001 


C 


6.9846 


<0;0001 


0.766 


71.6 


<0.0O01 


0" 


0.8692 


<0.0001 


0.855 


68.3 


<0.0001 


E 


6.9B6 


<0.0001 


0.884 


144.3 


<0i0001 


F 


0,9968 


<0.0001 


0.865 


151 6 


<0.0001 


All 


0 9964 


<o,oooi 


0.845 


174.5 


<0.0001 



■^project :D, the Sjaedrman correlation coefrrcieht p was artificially low asonly very few bases (1 0) bases had 
a t]tiiljty score of 5, and none of theie basfes contained an actual error (expected: 3.1 d errow). Exclusion of 
this <JMa;fUy score gave a Spearman correlation coefficient of 0,9786 (/' < 0.0001 ) . The frequencies In the slope 
talpa^ by the hgnl&cfr of bases at My given quality score and, thus, were not sensitive to 

such >>niail sample distortions (see Meth 
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Ae 

Project e 


tual minfmum 
rror rate (%) 


Actual average 
error rate 0b) 


Length Of 
<1% error regl 


Length df 
on <5% error region 


Average bases with 
terror) <q.i% 


A 
B 
C 

b 


0.36 

0,34 

0.23 

0.39 

071 

2 58 ' 


3.6 

2;4 

3.1 
4.7 
92 


422 
274 
291 
300 
129. 
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682 
567 
479 
403 
464 
162 


468 
395 
348 
294 
317 
79 
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ity analysts and control in tergc-$calc DNA sequenc- 
ing fw}ect$. To analyze how accurate RED error 



the same sequencing project, we subdivided a data 
set into four quartNes, based on the number of very 
high.quality bases In each sequence (see Methods), 
The comparison of actual and predicted error rates is- 
shown in Figure 2. 

When measured by the error rate in the best 
region of a sequence, the data quality in the differ- 
ent quartiles varies > 100-fold between the best and 
the worst 25% of the .sequences. The best quartile 
showed -0.03% error for >100 bases, whereas the 
error rate in the worst quartile always exceeded 5% 
pqu^ril^^^d' & the p^fed^for xates rnatch 
the actual error rates very closely in the bevt aiid 
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^. .If* I Actual and predicted error rates th different quality subsets of project 
B Sequence reads were sorted by the number of bases with a predicted error rate 
Of at most 0;T%rvervhi^ ^Uw^*w>-*...r-«L: L*^ 
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worst quarUlcs, PHRED's accuracy wag somewfiat 
lower from base 100 to 500. hi the best sequences, 
PHRfvl?'* error estimates were about twofold too 
high; in the worst sequences, the error estimates 
were too tow, again by a factor of 2. This underpr* 
. ^tioii of e^ 

fact that PHREI) gives ambiguous base calls (N'$) a 
quality ^ore of 4 > corre^ondlng to ah error prtib- 
abriity of 39 «%; however; ^ will always show up 
as an actual error, Even m the mm and best quar* 
*%> however, the predicted error rate curves are 
very ^irnllar to the actual error rate curves 

the results shown in Figtire 2 alio demonstrate 
that the count of very liigh-qu^ity bases, or bases 
with an estimated crior probability of at most 0,1%, 
can be used effectively to characterize the overall 
qualify of a sequence read. 
Sorting the sequence reads 
into quaitiles: based on the 
riurnber of very high-qu^ty 
bases worked well, as shdwn 
by the >l00rfold difference in 
the niinunum error rate be- 
tween the first and the fourth 
quartile 

Other methods to charac- 
terize the overall quality 6f in- 
dividual reads based oft 
FHRED quality scotes can #ivc 
similar results. For example, 
counting bases above a mini* 
mum quality threshold any- 
where in the range of 20-40 
gave similar results for most 
data .sets (not shown), and 
such counts are. used by a 
number of. different laborato- 
ries asquality measures. Alter- 
natively, the quality values 
can be converted to error 
probabilities aiid averaged to 
give the predicted error rate 
for the trace, or summed to 
give the total predicted num- 
ber of errors in a trace. How- 
ever, such averages and totals 
can sometimes give a m islcad- 
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-■rT^'^ '^^ '^^ w'wu vy uiv numuer ureases wiura predicted error rate ° ■ ""'^ 

ofat nio^0:1% (very W wiih uartile ing picture, as the following 

1 ^Prespohding m the highest numbers • Actual and predicted error rates for al I example illustra tes, Assh m e 

sequences in each subset wer# calculated as m Fig, I , Note that a riurnber of that two sequence reads have 

sequence, re^ds tharhad been netted because Of too li>w quality were added v ^y similar quality in the 

back to the data set for illustrative purpbses^ all of Which are in quartile 4. These alignable part of the read but 

tSFF^TF^ ******* uied to^ttfe-wte Flgs/Vand 3 and that one of the two sequences 

95 ^ was run much longer and 
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sionally lead to questionable 
conel&&$^ w M$ wsdts 
shown in Kgure .iiiiustrate. 

Figure 3 shows the totDl 
actual error rates and the 
frame$hift error rales /or two 
projects, A and B. The total er- 
ror rates for both projects arc 
similar for up to MO bases; af- 
ter 350 bases, project B has a 
somewhat higher total error 
rate. However examining the 
frames hi ft error rate gives rise 
to a different picture: from 
base 1 to 500, project A has 
approximately four times as 
many insertions arid dele- 
tions as project B. This differ- 
ence iii irauiexhift error rates 
can be explained by the se- 
quencing chemistries that 
were used in the two projects, 



Figure 3 Actual frameshift and total error rates for projects A and B. To calcu- 
late frameshift error rates, only insertions and deletions were counted. Mismatch ui wie xyyp projects; 
^tors, vvhjcb account for the vast majority of errors after base 1 50, were Included Project B, with the lower 
f& n u e u° ta,err0r COunL N ote that project B (A, A) has a slightly similar or frameshift enoi rate, used 
S?ilfftr? 9her err ? r ratc 5°™P ared 10 A (#,G) but only about only dye terminator chemls- 
WM^rW^ and ^ l0LianS UP 10 ba5e 500 - For both P ro ' ects ' the ^ which is known to elimi- 
^n*?^ * < l ^ W0 for>300 ba,e vand al In n L band spacing artifact 



1 0;000 for >ioo bases in project B. 

therefore contains a loiter uriallgnable "tall" of 
very low^quality bases When calculating the aver- 
age error rate for these two sequence*, the second 
sequence will have a much higher average error and, 
therefore, appear to he of lower quality, in contrast, 
the couaty.uf very htgh-qiiafity bases for both se 
qiienees Wilt be very siniiiar; as the UTialignable tails 
c^tajii few* if high-quality bases.; Therefore, 
couT)t^0f bas^s ^oye a high chough Quality thresh- 
dld wdl give a more robust and clearer picture of 
trice quality . 

Frafheshjft Error Rates for Different Sequencing 
Chemistries 



— : W$m on how biologists use DNA sequences, 
knowledge about total error rates in raw sequences 
m$y m -imy pqf M suffiicKnt, For example, frame^ 
shift enors m coding sequences will generally lead 
to incorrectly predicted open reading frame, 
whereas mismatch criers will do so only if the mis- 
match introduce* a stop codon or a new splice site. 
At the time of this writing/ PHRED did not differen- 
tiate between mismatch and frameshift errors, but 
only estimated total error rotes. This might occa- 



from hairpin sttuctures ("com- 
pressions^); Project A, on the 
other hand, used dye primer chemistry, which is 
more prone to insertion and deletion errors from 
mobility artifacts, for most sequencing reactions. 

DISCUSSION 

As large-scale DNA sequencing has become a more 
routine and com mon process, the traditional meth- 
ods for assessing seq u enc e qual i ty have become un- 
satisfactory. In projects like single-pass cl)NA se- 
quencing, it Is not possible to calculate and compare 
error rates after finishing a sequence, as finishing 
never takes place, Fyen when a comparison between 
raw and finished sequence can be done, the time 
delay between raw data generation and quality as- 
sessment is often large. Tins delay make* it difficult 
to improve ongoing projects, and it sometimes 
makes it uupossible to capture problems earljr on 
»Some immediate quality feedback can be reached by 
including known standard sequences for quality 
control. However, this approach can be costly, and 
it fails when error profiles differ between standard 
and unknown sequences. 

in contrast to these traditional methods to as- 
sess sequence accuracy, direct estimation of error 
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quality control and feedback. Accurate, base-by- 
hute atlmaws of error probabilities could also in- 
crease the utility of singfe-pass sequences .signifi- 
cantly, allow efficient comparison a/id optimization 
of different sequence chemistries, and enable the 
development of better software tools for sequence 

The critical question for any error rate predic- 
tion tool is how accurate are the error rate estimates, 
In particular if different sequencing methods and 
chemistries are used? The results presented herein 
. provide an answer to this question foT the program 
PflRT-T), as well as dues where further development 
would be useful. As shown In Tables 3 and 4 and in 
Figure 1, the agreement between predicted and ac- 
tual error rates was very good in each of the six 
different prolccls analyzed, ibe observed high level 
of prediction accuracy In all of these projects is al- 
most astonishing if one takes into account that ac- 
tual errors are binary (a base is cither Correct or 
Wrong); whereas predicted error rates are probabili- 
ties on a scale bom 0.0 to 1.0. The observed ten- 
dency to overpredict error rates can be ot least par- 
tially explained by the "small .sample correction" 
that was used in the derivation of threshold param- 
eters for quality scores {Kwing and Green 199S). For 
most praebcarapphcations, sueli a suuiewhat con- 
servatlve esttmation of quality scores is tolerable ox 
even desjrable. Overall, the results clearly show that 
error probabilities given by PHRED accurately de- 
scribe raw sequence data quality 

In judging the usefulness of predicted error 
probabilities, it is important to know how differ- 
ences in sequencing methods will influence the pre- 
diction accuracy ror example, the larger variation 
In peak heiglits tends to be larger In dye terminator 
sequencing than in dye primer sequencing, arid dif- 
ferent sequencing en^mes are known to produce 
different specific height variation patterns A nv es- 
timation °* error probabilities that takes the pecu- 
■ : harltlcs of a specific sequencing chemistry Into ac- 
count would therefore be expected to be less accu- 
rate for different chemistries 

The projects included In this study were specifi- 
cally chosen to provide an initial answer to the 
queshQrt of how generally useful PHRKD quality 
scores are. These projects represent the vast majority 
of different multicolor fluorescent sequencing 
mjthods used in the last 3 years different templafe 
DTfAs and DNA preparaflon methods, different en- 
gines., gel lengths, run conditions, and different 
fluomceM dyes. The data also Include a consider- 
able spread in data quality, both between projects 



ESHMAHON or ERRORS IN RAW DNA SEQUENCES 

and within individual projects, None of the projects 
analyzed here were included in 1>HRKD\ training 
set, and just one of the six laboratories that contrib- 
uted data to this study also contributed data to the 
training data .sets. One of the projects in this study 
consisted entirely of dye terminator sequences 
; which presented only a small fraction of ihc se- 
quenccs in the test data set. Another project exclu- 
sively used a set of fluorescent dyes different from 
those used in the training sets. Each project differed 
from the other projects in this study in at least one, 
and typically many, experimental aspects lice ^tem- 
plate piepatauoi), sequencing enzymes, gel run con- 
ations, and so forth Despite these differences, tire 
accuracy of erroT rate predictions was very similar 
for all projects; 

Our results justify some optimism about the ac- 
curacy Of PHRED quality scores for minor changes 
in sequencing technology, for example, sequences 
generated by new enzymes and fluorescent dyes 
Initial studies showed that PHRED quality score* 
were also accurate for sequences produced by mul- 
tiplex sequencing with radioactive detection (P 
Rrchtejlch, unpubl,). However, we also observed 
two effects that can invalidate PHRED quality scores 
during these studies. First, sequences generated by 
chemical sequencing gave t<x> low quality scores at 
mixed (A + G) reactions Because secondary peak 
height is one of the parameters used In the error rate 
predictions, this is not surprising: Another potential 
source of error is High-lrequency noise in the trace 
data. With such data, PHRED occasionally underes- 
timated the hand spacing by a factor of 2 or more, 
which resulted in incorrect base calls and quality 
scores. By applying simple smoothing algorithms to : 
data with high-frequency noise* these problems 
Could typically be resolved. Similar steps may be 
necessary to obtain accurate PHRED quality scores 
on data thai have been generated by different se- 
quencing instruments or preprocessed by different 
software. 

Accurate quality scores can have a major impact 
on hpw sequences are used downstream : from the 
sequence production process. In traditional se- 
quencing projects where the goal is complete cov- 
erage ata final error rate below (e.g.) 1 in 10,000, the 
accuracy goals can be reached with single sequence 
reads as long as the quality scores are at least 40 
(however, other potential problems like clone insta- 
bf jity may make higher coverage advisable). Inter- 
esting questions arise as to how individual read 
qualify contributes to pro|ect quality, or the error 
rate of the "final" sequence. Under the assumption 
that errors between different sequence reads arc 
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completely independent, one could argue that two 
reeds with a quality score of 20 (error probability of 
1 in 100) are just as valuable as one sequence with a 
quality score of 40 (error probability of 7 in 10,000) 
However, although a single sequence strclch with 
quality levels above 40 would give a final sequence 
with ar i error rate of <i i„ io,G00, assembly « con- 
sensus from Iwo sequences with quality scores of 20 
(1 m error rate) could lead to one of two results- if 
the errors were completely random, the consensus 
sequence would be ambiguous at 2% of all loca- 
tions; if the errors were completely localized, for ex- 
ample, because of reproducible compressions the 
consensus sequence would have one "hidden" error 
every 100 bases, Typically, consensus sequences de- 
lved from loiv-quaiiiy sequences will have both 
Jorids of problematic regions. Increased coverage 
can rapidly ^liminaie the random errors; however 
increased coverage does not resolve errors frbrh sys- 
tematic sources. Manual examination of such prob- 
lem areas is generally required; such "contig edit- 
ing, however, tends to be time consuming re- 
quires highly trained personnel, is an ohst'acle 
toward complete automation of DNA sequencing 
and sometimes fails to eliminate all errors This 
leads to the somewhat counterintuitive conclusion 
that the practical value of increasing sequence qual- 
ity can be even higher than indicated by the quality 
scores? One sequence of average quality above 40 
Tmty'zQ^'' A ° re thttlt lw « sequences of average 

t . t AnOT ( h ^ application of DNA sequencing where 
higji quality can be of d isproportionately high value 
is the search for mutations in genomic DNA In low 
quality sequences, secondary peaks and low resolu- 
tion often complicate the identification of hetero- 
zygous mutations. In regions of higher sequence 
qwahty, such secondary peaks are smaller or absent 
and peaks are better resolved, Therefore, both false- 
mmvit and false-negative errors can be signifi- 
nunl y rs redUCed in hi « ll - t l ual ity regions. Tools like 
i ttmp,vim<m can accurately measure sequence 
quality from trace data, can be of twofold value for 
mutation detection. First, base-specific qualify 
scares can allow optimfranoh of sequencing meth- 
(Ms and strategies for mutation detection. Second 
m quality score* can be used to evaluate the use- 
fulness of individual sequence reads for mutation 
dclcef^ 

thresholds), and they can guide software that auto- 
matically detects mutations 

Tiie ability \o predict error rates In a hiehlv ac- 
«ir*te fashion, is likely to have a major impact in 
applications like those described above. PHRED is 
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the first widely used program that accurately p«». 

algorithm for determining quality values has been 
described (Ewing and Green 1 9%), and It should be 
straightforward to implement similar quality values 
hi other base*alHbg progttiias. Fuffhermore, an ex- 
tension of the approach develOiK'd by Kwfng and 
Green .should be possible. Por example, differehtia- 
tion between mismat^ahd frames hi ft errors would 
enable better comparisons of sequencing methods 
with similar total error rates hut different frameshift 
error rates, Several groups have described efforts to 
catenate separate probabilities (or ''confidence as- 
sessments") for mismatch errors and trameshlft er- 
rqrs (Lawrence and Solovyev 1994; Bernb 1996) 
Their results demonstrated that different ap- 
proaches to error type characterization arc feasible 
and promising, implementation of such error type 
predictions in other programs similar to the way 
PHRED uses quality scores would enable better 
method assessments, benchmarking, and production 
quality control, and could have a significant impact 
on downstream uses of £>NA sequence fnmtmatioh 

■ 

Tor one project, sequence raw data in the form of 
ABI iracc files were downloaded from a public FTP 
site. Sequence data for the five other projects were 
kindly provided by five different large-scale se- 
quencing groups Table I gives a summary of the six 
projects, and Table 2 gives an overview of the dif- 
ferent sequencing methods used in the projects The 
projects differed in the amount of prescreeriing of 
data that had been done, reflecting different ap- 
proaches to quality control in different laboratories 
In two projects (II and C), different software pro- 
grams had been used to identify and eliminate low- 
quahty sequences One project (F) included all data 
hies genetated, whereas the otherth fee projects had 
excluded "failed lanes " ' • 

Comparison of Actual and Predicted Error Rates 

The sequences for. all traces in each project were 
recalled using the program PHRED (v. 961 02H) 
Next, sequences in each project were assembled 
with phrap (P. Green, unpubl.). Slightly different 
method^ were chosen for the statistical and graphi. 
cal evaluation of the error rate prediction accuracy 
in the statistical evaluation, only me lorigwi eoiitL 
produced by PHRAP was considered. The tables of 
aligned bases and observed dlscfebancv coiihts for 



each quality score were laken from the PHRAP out- 
put and analyzed as follows. The expected number 
of discrepant (© at each quality score ft) was ed- 



(N) with the error probability corruspoiidiug tu the 
quality score: N KT 0 ^. The Spearman ranking 
coefficients weTe calculated by comparing the ex- 
pected and observed error frequencies. To obtain 
the quantitative relation between the expected and 
observed error rales over ihc entire range, a least- 
sqiiares fit between the observed and expected rates 
was performed, with the intercept set to *ero and 
the number of aligned base* at each quality score 
used as weights. 

For a graphical comparison of estimated and ac- 
tual ettor rates in 50-bp windows, the following 
slops were lakcn. For two of the projecis, the con- 
sehsus 

bases. Lor the four other project*, the DNA sequence 
and qtiaiity informatf on werie used by tlie program 



the p?o|gcts the individual reads were aligned to 
tHe ^ 

uig Ihc program CROSS_MATCH (P. Green, un- 
publ.), after removing single-coverage regions from 
the ends of the consensus sequence. CROSS- 
_MATCH uses an implementation of ihe Smith 

typically do nut include the ends of sequences, 
; where disagreements are cpinmpniy due to vector 
sequence or low quality sequence. 

The quality flies generated by PHRED and the 
a I jgrifnent i\\ m ma rl es ^eneTated by GRC55S- 
_MAT<2H wer<* then analyzed as follows. First, the 



by CROSS_MATCH was determined, Next, ihc actual 
a^dpii^it^ 

each individual sequence Was calculated; jii addi- 
tion, the average actual and predicted error rates for 
;9 ; l( aligtiab^ 

\y 10 iio *y s ipf 30 bises; in % <^t^ate Wi pte- 
dfcted error rate, th e qualify Scores ^ determined by 
UHitEt) at each base were cohvciicd to error prob- 

- 1> ' * v ' :: - l — - - 1 1998) 



Subdividing Data inro Subsets Based on Para Quality 

To examine the accuracy of PHRED qtmllty scores 
for data subsets of different quality within a project, 
the following approach was taken , For all sequence 
read? iti project B, the nuihbej: of bases i^t^-:a^u«ii- 
tty scpt^ of a 

tftirfad (bpes with ::^^ j j.i-i^^00re& oi^l^st 3 r 0 : -were 
called very high-quality Mfces, or VHQ bases). Se- 



quences weie sorted in descending order based on 
the mimb^r of very hi^^iaHty mm, nMtiWtiM 
f&d ttmt ■■^mm^ &m&t$fa$i$ r quartile i cbn- 
tained 25* of seqtiences with the highest riu^ber 

accuracy in data with relatively liigh error rates, se- 
quences from project B that had been "discarded" 
because they had not met the minimum quality cri* 



in each quartile were compared to the consensus se- 
quences that had been generated using the entire data 
set, as described above for the graphical comparison. 

Determfning Actual FraniesJiift Error Rates 

The caleulatipn of actual frameshift error rates in 
the raw sequence data was performed using CROSS 
^MATCW, Similar to the procedure described above 
for total error rates, except that only insertion and 
deletion errors were counted Because PHRED docs 
not give separate frameshift error estimates, a com- 

: crrdrs is 



for contributing th$r 
daifl, Or Josee l)tipun for help with the statistical analysis, 
anil til Phil Gwii: lor liclfl ful discussions 

The publication costs of this article wcic defrayed In part 
h y payment 0 f p^ e charges, This article must therefore be 
hereby marked f ^liverUsefticiU" In accordance with 18 USC 
section 1734 solely to indicate this foot. 

REFEREKlGES 

Btiriio, A.J. 1596. A graph Ihenreric. approach to the analysis 
of DNA sequehcJhy da la . Genome Res: 6: 80-9 1 - 

tfonricjld, J.K and 11 Staden. 1995 The application of 
numerical estimates of base calling accuracy to DNA 
sequencing pri>jects. Niwlek Adds Res 23: 1 406- 1 410, 

Churchill, C, and M.5. Waterman 1992 I he accuracy of 
ONA s^cquehees: estimating sequence qua II tv. Genomics 

KWing, R. arid P. Green. 1998. Base-caltag of automated 
sequencer traces using phretl. II Error probabilities. Genome 
(this issue). 

Ewltig, B., L. HUjlef/ M.C. Wcndl, and P. Oreen. I99R. 
nase^ratHng of automated sequence traces using p/tfwf. I. 
: Accuracy assessment. Genome Res. (this Issue), 

Lawrence, GJ. and V.V. Solovycv. 1994. Assignment of 
position-specific error probability, to primary sequence data, 
Mtffefc AvUis Res. 22: 1272 1280. 



Rvwlvvti QetoHier 27, 1997; accepted in revised form February li, 
W8> 



GENOME RESEARCH ^259 



n 




Lanrk* Gopiinian 

Cold Spring Harbor Laboratory 
MarkBotfnskl 

Nutjurmf Center lor Biotechnology 

iniormatfon, niu 
Arivtnda Chnkravarti 

Case Western ResemVpniyc^ 



:■■ f -B^lbr College of Mwildnu 
Eric tfrecn 

'National Human Genome 

Research Instfmriv NIH 
Richard Mym 

SMfe«i Uiiiyeisity School of Medicine 



Rakesp Anand 

Zcneco Phamaeeutlcals 
StJ^flj? AnKinni-dkis 

University of Geneva 
Oiarlej Auffray 

am 

Philip Avner 

^titut: rafteur 
And mi Hntlablo 

telethon insiltute of Genetics and 

Medicine 
David Bent ley 

The Sanger Centre 
Bruce Bitten 

: Wrtf^^ 

Gfrmom* RpsKin ii 
MKhael Boehnke 

University of Michigan School of 

Publlr Hfijltl* 
An tie Bowcock 

University of Texas Southwestern 
: Mc^ctil Gciiter . 
David Buike 

University of Michigan Medical School 
Jeffrey Chamberlain 

University of Michigan Medical School 
EIIsoq Chen 

l>ertiln?Elmer Cotporotfrm 
OavhJ R Q>x 

Stanford University School of Medicine 

S hi rite j r 1 1 1 ) 1 1 i vi rjs i I y School of Medicine 
Richard Durbin 
•'■: : :Sanger : ;<^ntrc,;UK , • 
Joseph &kcr : ■■■y'.'\- 

Uiflyeisiry f enQsylvtinia 
Beverly $. Em&nuel 

Children's Hospital of Philadelphia 
Raymond Kcnwick 

•BiOdafe laboratories 
Chris Fields : .. 
■ : National <;is^^ 
Simon Foptc 

Walter and Elfca Hull institute; of 

Mcifkol Research 



Cold Spring Harbor laboratory tfress 
1 Bungtown Road 

Cold Spring Harbor, New York 11724 
http://w ww, tsml.i irg 



Phil Green 

Kcnshi Hayashi .;. 

Kyushu University 
Philip Hleter 

ihfjohns Hopkins University School 

of Medicine 

Cian> Huxlty 

St. Marys Hospital Medical School 
Howard jo Jacob 

Medical College of Wisconsin 
AlecJ^rirttys 

Ur^ 
Mark Johnston 

Washington University School of 
■ •■ Medicine 
Mary-Claire King 

Uiiivcrsity of Wastiington 

University of Victoria 
PuiYanKwok 

Washington University School of 

Mniidi it: 
Dlf Lande^rcn 

l?Pj»:ahi. : Biomedical, ('enter 
Mark Lath rop 

The Wettr oTtif Tnisl Centre 
Michael Lovett 

University of lexis Southwestern 

Medical Center 
Jcn-iMun . 

Qenorrie Therapeutics Corporation 
Douglas Marthnk 

Duke University Medical Center 
Tlidrhas:;^rr': : :'' 

( .old Spring 1 Iarbor laboratory 
W ttlcharil MLCuhibie 
. Cold Spring Harfejor Laboratory 

Susuii Nilylor 

University of Texas Health Science 

Center • 
liavidSeLson 

Baylor College of Mcdicinc 



Reviews Eikor 



Aliwiii Stewart w 



Maynard ^isonU ^ ' 
University of Washington 

SvanteTObo 

Uhlveigity of Munich 
ieeua Mltonen 

National Public Health Instihite, 

Helsinki 
David Porteous 

MR€ Hi ft riai i (tenelich Unit 

Western General Hospital/ Edinburgh 
Roger Reeves 

Johns Hopkins University School of 

Medljclhe 
Bruce Roe 

University of Oklahoma 
Rodney Kothstein 

Columbia University College of P&S 
Gerald Rubin 

University of California, Berkeley 
LJOyd Smith 

University of Wisconsin-Madison 
Randall Smith 

Baylor College of Medicine 
Marcelo Berito Soares 

University of Iowa 
Wlttlaru Studier 

Drookhaven National Laboratory 
Grant Sutherland 

Women's and Children's Hospital, 

Adelaide 
Barbara Trask 

UnlytTsilyiif Washinglon 
Gertjan D. van Ommen 

Ldldcn University 
Robert D Weiss 

University of Utah 
Jean Weissenbach 

Gchrthon/(MRS 
RiChard Wilson 

Washington University School of 

Medicine . 
James Wohtack 

Tpxas AttM University 



Editorial/ Production 



Nadhie Dumser, Technical Editor 
Kristin Kraus, Production Editor 
Cynthia Grimm, PrtHJutilon Editor 
R«HHy <^iiochia, EdltoriaJ Secretary 



